A survey of sketches in traffic measurement: Design, Optimization, Application and Implementation

Shangsen Li, Lailong Luo, Deke Guo, Qianzhen Zhang, Pengtao Fu Science and Technology on Information System Engineering Laboratory, National University of Defense Technology, Changsha Hunan 410073, P.R. China.

Abstract—Network measurement probes the underlying net- cardinality), the number of packets for a flow (flow size), the work to support upper-level decisions such as network manage- traffic volume contributed by a specific flow (flow volume) ment, network update, network maintenance, network defense and the time information. These statistics together profile and beyond. Due to the massive, speedy, unpredictable features of network flows, sketches are widely implemented in measurement the network and provide essential information for network nodes to approximately record the frequency or estimate the management and protection/defense. cardinality of flows. At their cores, sketches usually maintain one or multiple counter array(s), and rely on hash functions to select TABLE I: Comparison with existing surveys the counter(s) for each flow. Then the space-efficient sketches from the distributed measurement nodes are aggregated to First author # Variants #Measurement tasks Year provide statistics of the undergoing flows. Currently, tremendous Cormode [9] ≤ 15 3 2010 redesigns and optimizations have been proposed to improve the Cormode [10] ≤ 20 3 2011 sketches for better network measurement performance. However, Zhou [11] ≤ 10 4 2014 existing reviews or surveys mainly focus on one particular Yassine [12] — — 2015 aspect of measurement tasks. Researchers and engineers in the Gibbons [13] ≤ 10 1 2016 network measurement community desire an all-in-one survey that Yan [14] — 1 2016 This survey ≥ 90 13 2021 covers the entire processing pipeline of sketch-based network measurement. To this end, we present the first comprehensive survey of this area. We first introduce the preparation of flows The term sketch refers to a kind of synopsis data structure for measurement, then detail the most recent investigations of to record the frequency or estimate the cardinality of items design, aggregation, decoding, application and implementation in a multiset or stream approximately. Basically, sketches of sketches for network measurement. To summarize the existing efforts, we carry out an in-depth study of the existing literature, hash each packet according to the interested flow key into a covering more than 90 sketch designs and optimization strategies. union of cells or counters. Sketches offer an efficient method Furthermore, we conduct a comprehensive analysis and qual- to record the existence of the active flow and its statistical itative/quantitative comparison of the sketch designs. Finally, information for later passive traffic measurement tasks. As a we highlight the open issues for future sketch-based network compact summary, sketches are typically much smaller than measurement research. Index Terms—Sketch; Counter; Network Measurement; Traf- the size of input and adaptive to solve network measurement fic Measurement tasks. Moreover, sketches guarantee constant time overhead of element insertion and information query operations with I.INTRODUCTION guaranteed accuracy. Measurement-based performance evaluation of network traf- The primary motivation of conducting this survey is two- arXiv:2012.07214v2 [cs.DS] 20 Jul 2021 fic is a fundamental prerequisite for network management, fold. Firstly, a lot of sketch designs and optimizations have such as load balance [1], routing, fairness [2], instruction been proposed to perform measurement tasks in recent years. detection [3], caching [4], traffic engineering [5], performance Existing surveys, however, are somehow out-of-date and do diagnostics [6] and policy enforcement [7]. With the explo- not cover these new proposals. The latest survey [13] was ration of network users and network scale, the underlying published four years ago and only covers a small partition of network traffic becomes massive and unpredictable, which work. Secondly, the existing surveys [12] [14] solely focus on challenges and complicates all the network functions. Network a particular or several aspects of measurement tasks and lack of measurement probes the underlying flows and provides the a comprehensive view on sketch design and optimization. The metadata to support network-related decisions and threat di- earlier work [9] [10] only covered less than 20 sketch variants agnosis. Efficient network measurement requires various levels and were published nearly ten years ago. These existing of traffic statistics, both as an aggregate and on a per-flow basis surveys did not provide comprehensive reviews on sketch- [8]. These statistics include the number of distinct flows (flow based network measurement. As shown in Table. I, compared with the existing surveys, our survey covers more sketch variants and considers more measurement tasks. In particular, Sketch Packets ... Update Query Preparation Optimization in sketch structure Measurement tasks (application)

Hashing strategy Time dimension

Persistence detection Conventional hash mapping Preparation for traffic measurement Flow latency Learning-based mapping Optimization in post-processing stage Volume dimension Collaborative traffic measurement Counter level optimization Optimization for sketch collection Per-flow measurement Monitor placement Small counter for larger range Cardinality estimation Traffic rerouting Sketch compression Virtual register design Sketch merging Super spreader/DDos Flow distribution Hierarchical counter-sharing Heavy Hitters Variable-width counters Hierarchical Heavy Hitters (HHH) Optimization for speed and space Information extraction Novel counter update strategy Top-k detection Less memory access Flow key reversible Entropy estimation Sampling Slot enforcement Probability theory based Randomization Machine learning Time-Volume dimension Batching Sketch level optimization Persistent Sketch Multi-layer Change detection Multi-sketch Burst Detection Sliding window Item batch detection

Fig. 1: Framework of this survey

our survey concentrates on the design and optimization for the II.SKETCHBASEDNETWORKMEASUREMENT sketch from a perspective of dataflow in sketch algorithm. Network measurement is indispensable for understanding In this survey, we first classify the sketch design and the performance of a network infrastructure and managing optimization from the perspective of data process pipeline networks efficiently. There are two ways to measure network of the sketch algorithm. Specifically, from the data recording performance: active or passive techniques. stage to the information extraction stage, the core components In active measurement, network flows are continuously mea- for a sketch data structure are update strategy, design of data sured by sending a series of probe packets over the network structure, query strategy, and the enabled functionality. Thus, paths. With the help of such probes, network operators are able we survey the existing works from three-dimension, i.e., prepa- to measure one-way delay, round-trip-time (RTT), the average ration for sketch-based network measurement, optimization in packet loss, connection bandwidth and adjust the forwarding sketch data structure, and optimization in the post-processing policies in case of load changes. It estimates the network stage. Besides, we further provide a comprehensive survey performance by tracking how the probe packets are treated in the network [15]. The accuracy is closely associated with of sketch-based measurement tasks and implementation.Sketch The comprehensivePackets framework is illustrated in Fig. 1. For each the probe frequency in general. Active measurement offers ... Update Query measurement task, wePreparation compare the corresponding sketch de- different granularity of end-to-end QoS measurement and is sign quantitatively and qualitatively. Optimization in sketch structure flexible since you can measure whatMeasurement you want. tasks (application) However, send- ing probes frequently may impose significant measurement Following the above directions, the remainderHashing of this strategy paper overhead that may disturb critical traffic.Time dimension Persistence detection is organized as follows. Section II gives a basicConventional description hash mapping By contrast, passive measurement provides an efficient tool Preparation for traffic measurement Flow latency of sketch-based network measurement. SectionLearning III details-based mapping the for charging,Optimization in engineering,post-processing stage managing and securing the commu- Volume dimension related work aboutCollaborative preparation traffic measurement for traffic measurement, includ- nication tools [16]. Without probing packets, passive measure- Counter level optimization Optimization for ing collaborative traffic measurement and packet preprocess- ment monitors the traffic traversingPer the-flow measurement measurement points Monitor placement Small counter for larger range Cardinality estimation Traffic rerouting Sketch compression ing. Section IV focuses on the design and optimizationVirtual register design in the to collect statisticalSketch merging information. In particular,Super spreader/D aDos passive mea- Flow distribution Hierarchical counter-sharing Heavy Hitters sketch data structure, including hashing strategy,Variable counter-width counters level surement system is able to summarizeHierarchical flow-levelHeavy Hitters (HHH) or packet- optimization, andOptimization sketch for level speed and optimization. space Novel Section counter V update focuses strategy level traffic, estimate flow/packet-size distribution,Top-k detection evaluate the Less memory access Flow key reversible Entropy estimation Sampling on the process when extracting information fromSlot enforcement the sketch fine-grainedProbability volume theory based of network traffic with different attributes Randomization and gives a detailed summary about technology for sketch for usage-basedMachine learning charging, and more.Time Besides,-Volume dimension passing mea- Batching Sketch level optimization compression, sketch merging, data supplement, and informa- surement can help identify abnormalPersistent traffic Sketch patterns to ease Multi-layer Change detection tion extraction. In Section VI, we compare theMulti sketch-based-sketch network management tasks, such as trafficBurst Detection engineering, load methods from the perspective of measurement tasksSliding window in terms balancing, and intrusion detection. Especially,Item batch detection a flowkey is of both the time dimension and volume dimension. Further, we often defined as the five fields in the packet headers, i.e., for conclude the related hardware and software implementation of IPv4 protocol, a flow identifier includes Source IP address sketch-based measurement in that section. Finally, we give the (32 bits), Destination IP address (32 bits), Source Port (16 related open issues in Section VII. bits), Destination Port (16 bits), and Protocol (8 bits), i.e., 1 bring challenges to measurement algorithm design. On the one hand, the high-speed feature make it difficult to record the size 0.8 of traffic flows accurately. On the other hand, the skewness of flow in the network further aggregates the measurement task. 0.6 Usually, the flow size/volume follows Zipf [20] or Power-

CDF law [21] distribution, as illustrated in Fig. 2 wherein traffic 0.4 trace was collected by MAWI Working Group [18] from its samplepoint-G at 2:00 pm to 2:15 pm on 29/02/2020. 0.2 Specifically, most flows are small in size, often known as

0 mouse flows; while a few of them are extremely large, 100 101 102 103 104 105 106 107 commonly known as elephant flows. For most measurement Flow size tasks, elephant flows are more important than the small ones. Fig. 2: The CDF of flow size distribution. There are 424.6 Therefore, the size of each counter should be capable of million IPv4 packets in total from WIDE traffic trace [18]. counting the largest flows accurately. In effect, the network operators may not know the size of elephant flows in advance, f = hSrcIP, SrcP ort, DstIP, DstP ort, P rotocoli. A flow which makes it tricky to set the length of each counter. f is defined as a set of packets that share the same flow key, The optimal measurement strategy for a specific measure- which can be flexibly assigned according to high-level need. ment objective typically assumes a priori knowledge about Therefore, the number of flow identifiers can be as large as the traffic characteristics. However, both traffic characteristics 2104 [17]. and measurement objectives can dynamically change over Active measurement obtains the network state, such as time, potentially rendering a previously optimal placement of bandwidth, by injecting probe packets into the network. How- monitors suboptimal. For instance, a flow of interest can avoid ever, the measurement packets will disrupt the network, espe- detection by not traversing the deployed monitors. The optimal cially when sending measurement traffic with high frequency. monitor deployment for one measurement task might become In contrast, passive measurement measures network without suboptimal once the objective changes. We survey the work creating or modifying any traffic. And we concentrate on which targeted at handling the skewness of flows in Section. sketch-based passive network measurement in this survey. IV. Further, packet batching and sampling strategies, which are proposed to match up with the high line rate, are detailed A. Challenges of passive measurement in Subsection. III-B. Note that, as illustrated in Fig. 5, the The challenge of passive measurement mainly lies in two packets of a flow may go through multiple paths from source aspects, i.e., resources limitations and unpredictable traffic to destination. To maximize the flow coverage and minimize characteristics. Therefore, most researchers envision an al- the redundancy, the switch in the middle should take part gorithm that can obtain interesting statistical information of in measurement tasks. The related work are summarized in traffic with limited resources. Subsection. III-A. Resource Limitation. When implementing a measurement task, a key challenge is how to perform it on a small high- B. Performance requirements for network measurement speed memory efficiently [19]. Nowadays, the core routers Due to the limitations and challenges of passive network forward most packets with the fast-forwarding path between measurement, we further elaborate the performance require- network interface cards directly, without going through the ment of measurement tasks. CPU nor main memory. To keep up with such high speed, High Accuracy. Most of the existing measurement methods measurement tasks are expected to perform in SRAM. How- provide approximate results. They provide theoretical guar- ever, the SRAM capacity is limited and is shared by all online antees that the estimated result has a relative error  with a network functions of routing, management, performance and confidence probability 1−δ, where  and δ are configurable security. Hence, the available space for measurement tasks parameters between 0 and 1 [22]. Novel updating strategies, is limited. To address this challenge, as stated in Section which are designed to improve measurement accuracy, can be VI-B1,VI-B2,VI-B3, lots of proposals have been investigated found in Subsection. IV-B6. And several multi-layer sketch to fit the measurement algorithms into SRAM, hybrid SRAM- designs to improve the measurement accuracy are detailed in DRAM, and DRAM. Moreover, researchers have tried to Subsection. IV-C1. perform measurement tasks with FPGA and TCAM as summa- Memory Efficiency. Take a compact data structure to rized in Section VI-B4VI-B5 as the development of hardware. perform traffic measurement is a core fundamental requirement Furthermore, with the emergence of software-defined network- for both hardware and software implementations. Because ing (SDN), efforts have been made to perform measurement of the vast number of flows passing through a network tasks in P4Switch and Open vSwitch, as shown in Section device, memory space becomes a significant concern. For VI-B6VI-B7VI-B8. a hardware implementation, high memory usage aggravates Flow Unpredication. Real network traffic is always high- chip footprints and heat consumptions, thereby increasing speed and non-uniform distributed. Such characteristics further manufacturing costs [22]. For software switches, although severs have plenty of memory [23], high memory usage will Measurement Query Collector deplete the memory of co-located applications (e.g., VMs or tasks Respond containers). The related work for improving space efficiency are detailed in Section IV-B. Furthermore, sampling strategies are also deployed to reduce the space overhead, as specified in Section III-B1. Real Time Response. Most measurement tasks should respond to traffic anomalies in time to avoid potentially Sketch catastrophic events. However, some sketch designs can support queries only after the entire measurement epoch. For example, before querying a certain flow, Counter Braids [24] must Fig. 3: Sketch-based network traffic measurement obtain the information of all flows in advance, and decoding should be executed offline in a batched way. As a result, the query speed is significantly slowed down. Additionally, several C. Sketch as a perfect choice work that focus on the most recent information are detailed in Section IV-C3. Sketch-based network measurement belongs to the passive network measurement, which does not cause any overhead in Fast Per-packet Processing.Processing packets at high the network since it does not send any probe packets. It allows speed by limiting the memory access overhead of per-packet the processing of local traffic states and the global behavior of updates is essential for network measurement. For example, traffic flows passing through specific network partitions. For a reduction in latency for partition/aggregate workloads can passive network measurement, monitors require full access to be achieved if we can identify traffic bursts in near-real the network devices such as routers and switches, as shown time [6]. The main challenges of high processing speed stem in Fig. 3. Generally, the sketch-based traffic measurement from high line rates, the large scale of modern networks, can be divided into the following steps: 1) the distributed and frequent memory accesses. A hardware implementation measurement points acknowledge their tasks and collect the usually prefers running algorithms in TCAM or SRAM, rather corresponding traffic statistical information passively, and 2) than DRAM, for better high-speed processing. For a software the collectors merge the measurement results from the subor- implementation, as can be envisioned in upcoming SDN and dinate monitors and respond to the measurement query. NFV realizations, the ability to perform computations on time As illustrated in Fig. 1, sketch-based network measurement greatly depends on whether the data structures fit in the experiences the same pipeline as stream processing. It is hardware cache and whether the relevant memory pages can be concerned with processing a long stream of traffic packets in pinned to avoid swapping [8]. The related works can be found one pass using a small working memory to estimate specific in Section III-BVI-B3VI-B1. The per-flow measurement meth- statistics of flows/packets approximately. To enable numerous ods are discussed in Section IV. The optimization strategies per-flow counters in a scalable way, replacing exact numbers for less memory access are detailed in Section IV-B5. with estimated counters is an intrinsic feature of sketch-based Distributed Scalability. Due to the space limitation of a measurement. These estimates reach an acceptable trade-off single switch, it is advisable to deploy a collaborative traffic memory occupation and estimation accuracy. Sketch offers measurement system wherein each switch is only responsible an efficient method to record the existence of active flows for a part of the underlying flows in the network. For more and their information in passive network measurement with details of collaborative traffic measurement, the related works accuracy guarantees. As a compact summary, sketches are are introduced in Section III-A. typically much smaller than the size of the input. Generality. There are various network measurement tasks. A sketch organizes its working memory as a synopsis data Designing a specific algorithm for each different type of task structure, specializing in capturing as much information about is costly and inefficient. Moreover, it requires sophisticated the target statistics as possible. For the vast flow key space, resource allocation across different measurement tasks to we are interested in the identity of active flows (during the provide accuracy guarantees [25] [26]. A general sketch-based measurement epoch). To record the massive active flows, measurement method means to support a variety of network- a brute-force method is to count and record them directly. measurement tasks simultaneously. Typical methods, which However, such a plain design will incur high memory access are general or partially general to traffic measurement tasks, and computing overhead. By contrast, a sketch-based method are listed in Table. V. usually maps the flow keys into a table with multiple buck- Reversibility. Although traditional sketch designs can an- ets/cells within constant time. The corresponding buckets/cells swer the point query, they fail to return the corresponding will store the interested information about the flows for later flow identifier from the counters. Thus, many efforts have been queries within constant time. Various sketches are proposed for made to make sketches invertible and capable of identifying different statistics about the flows. More generally, there are the interested flows from the sketches directly. The details are two primary operations in a sketch-based algorithm: update described in Section V-B1. and query. The flows in a measurement epoch are represented w buckets

+1 5 2 6 +1 Flow +1 key d rows ingress +1 3 outgress 1 8 +1 7 +1 4 Fig. 4: The CM sketch structure Fig. 5: Example topology for multiple routing path

e 1 by the sketch with the update operations and thereafter queried Theorem 1 [27]: If w=d ε e and d=dln δ e, the estimate sˆi has for upper-level network analysis. Note that, different sketches the following guarantees: si < sˆi; and with probability at least may provide distinct update and query strategies. Moreover, 1 − δ. most kinds of sketches enable function distributed measure- sˆi ≤ si + εkfk1 (1) ment scalability by merging operations. Some sketches embed where s is the real size of the ith flow in the flow set f; sˆ additional information into cells to enable reversibility. i i represents the point query result of the ith flow from the CM D. Count-Min sketch as an example sketch; and e is the base of the natural logarithm and it is a constant chosen to optimize the space overhead for a given As the most widely applied sketch, Count-Min (CM) sketch accuracy requirement. [27] has been used to perform per-flow measurement, heavy hitter detection, entropy estimate [28], inner-product esti- III.PREPARATION FOR NETWORK MEASUREMENT mation [29], privacy preserve [30], clustering over a high- In practice, the measurement module is integrated as soft- dimensional space [29] and personalized page rank [31]. ware inside a switch or deployed as standalone measurement The CM sketch can represent high dimensional data (like hardware that taps into a communication link. As shown in the 5-tuple flow key space) and answer point query with strong Fig. 3, once a measurement module is placed or deployed on accuracy guarantees within constant time. With such design, a link, it can capture packets carried by the link according CM sketch supports various queries, such as heavy hitters, to its configuration. As illustrated in Fig. 5, each flow may flow size distribution, top-k detection, and beyond. Since the traverse multiple switches on its routing path, and any of data structure can efficiently process updates in the form of them can perform measurement tasks. Moreover, introducing insertion or deletion, it is capable of working under the traffic a measurement module on a link needs additional deployment stream at high rates. The data structure maintains the linear cost, including hardware/software investment, space cost and projection of the data with several other random vectors, which maintenance cost [33]. There is no need for one switch are defined implicitly by simple hash functions. Increasing to measure all passing traffic. Therefore, the network-wide the range of hash functions or the number of hash functions measurement functions should be installed as efficiently as improves the accuracy of the summary [32]. Because of this possible. Thus, performing measurement environment prepa- linearity, CM sketches can be scaled, added, and subtracted, ration is necessary before measuring the flows. Besides, the to produce summaries of the corresponding scaled network actual measurement operation performed by monitors also traffic. However, due to the irreversible process of hash, CM factors into its operating cost. Due to the per-packet operation sketch can not support flow key reversibility. cost of each monitor depends mainly on the line speed, it Specifically, it hashes an arbitrary flow into each row of is nontrivial to perform network measurement for high-speed the and leverages the corresponding counters to traffic within limited resources. It is necessary to pre-process record the multiplicity or size information. It consists of traffic packets whenever possible to reduce the computation d arrays, A1...Ad, each of which contains w counters, as and storage overhead. shown in Fig. 4. When measuring the network traffic, each flow key is extracted from the packet header and hashed by A. Preparation for measurement environment the d independent hash functions, h1...hd. Thereafter, the d 1) Monitor Placement: corresponding counters, i.e., ∀ 1 ≤ k ≤ d, Ak [hk (f)%w] are In collaborative traffic measurement scenarios, it is urgent added up by the flow sizes. Because computing each hash to minimize the overhead of hardware, software, and main- function takes O(1) (constant) time, the total time to perform tenance, with the guarantee of measurement performance. an update is O(d), independent from w. To query the size of To ensure network-wide coverage, previous work focus on a flow f, CM returns the minimum among the d counters as the optimal deployment of monitors across the network with the estimate. Moreover, CM only incurs over-estimated errors the target to maximize the measurement utility, such as the following accuracy guarantee. coverage of measured traffic and the efficiency of measurement TABLE II: Collaborative traffic measurement

Methods Core idea Suh et al. [33] Minimize cost and maximize coverage Sharma and Byers [34] Minimize duplication of effort Monitoring Placement Cantieni et al. [35] Decide which switch should be actived OpenTM [36] Choose which switches to measurement MMPR [37] Traffic Rerouting Make traffic adaptive to netowrk and requirement Collaborative measurement MeasuRouting [38] OpenWatch [39] Compute the optimal balanced task assignment LEISURE [40] Consider load-balancing under different objectives cSamp [41] Specify hash ranges per OD-pair per router Flow Distribution cSamp-T [42] Without relying on OD-identifiers DECOR [43] From local to global optimization DCM [44] Leverage the BF [45] to allocate responsibility Xu et al. [46] Leverage hash range to allocate responsibility functions, and minimize the cost of the measurement. The occupation. Sharma and Byers [34] propose space-efficient optimal strategies are in the form of traffic routing path data structures for gossip-based protocols to approximately or the optimal measurement point allocation in the whole summarize sets of measured flows. With some fine-tuning of measurement-enable network. the methods, they ensure that all flows are observed by at least Under the measurement functionality un-powered situation one monitor, and only a tiny fraction of flows are measured in switch, Suh et al. [33] consider the minimum cost and redundantly. OpenTM [36] explores several algorithms to maximum traffic coverage problems under various budget choosing switches for query. They further, reveal that there is a constraints. They derive out a monitor placement strategy trade-off between the accuracy of measurements and the worst- and specify the sampling rates of each monitor in order to case maximum load on individual switches. The work shows maximize the fraction of sampled IP flows. They first find the a non-uniform distributed query strategy that tends to query links should be measured and then determines the sampling switches closer to the destination with a higher probability has rates with an optimization algorithm. Besides, Minimum Edge a better performance compared to the uniform schemes. Cost Flow (MECF) [47] studies the problem of assigning tap Measurement point placement problem focuses on the opti- devices for passive measurement. They consider three main mal placement strategies under the measurement functionality problems, the first one maximizes the volume of captured un-powered situation [33] [47] or the measurement activation traffic under cost constraints. The second problem minimizes problem in measurement powered situation [35] [36]. As the the deployment cost to achieve a measurement objective and core of collaborative measurement, these works are essential the last one minimizes both installation and operational cost for setting up the environment for sketch-based measurement. under the same objective. The authors present a combinatorial 2) Traffic Rerouting: view of the problem and derive complexity and approximately The above works focus on the optimal measurement point results and efficient and versatile Mixed Integer Programming placement across the network. However, both traffic charac- (MIP) formulation. For passive measurement, they study the teristics and measurement tasks can dynamically change over problem of sampling packets and then present an efficient way time so that a previous optimal placement of measurement of placing monitor devices and how to control their sampling points may be suboptimal for the current network state. When rates. They show that all these problems are NP-complete and it is not feasible to adaptively redeploy/reconfigure measure- they present heuristics to approximate the optimal solution ment infrastructures to cater to such changing requirements for each one. They evaluate the performance of the proposed [38], a possible strategy is to reroute the traffic across these algorithms with simulation on topologies discovered by the monitoring points to get better measurement performance. Rocketfuel utility and with generated traffic matrices. In addi- MeasuRouting [38] addresses the problem where differ- tion, this work only considers the static placement of monitors. ent measurement points have different measurement abilities In practice, current routers deployed in operational networks for certain kinds of traffic by rerouting interested traffic are already equipped with measurement modules (e.g., Net- across measurement points. Building upon MeasuRouting, flow [48], Openflow [49]). There is no need to turn on all Measurement-aware Monitor Placement and Routing (MMPR) these measurement functionalities because of their expensive [37] considers that not only the number of deployed monitors operational cost and measurement redundancy. Because there is limited, but the traffic characteristics and measurement are potentially hundreds of candidate measurement points to objectives change both continually and simultaneously. The choose for network-wide measurements. Cantieni et al. [35] MMPR framework jointly optimizes monitors placement, dy- focus on which monitors should be activated and what sam- namical turning on and off the monitors, and dynamic routing pling rate should be set on these monitors in order to achieve a strategy to achieve maximum measurement utility, quantifying given measurement task with high accuracy and low resource how well each flow is monitored. MMPR formulates the problem as a MILP (Mixed Integer Linear Programming) prob- into smaller pieces called optimization units. Therefore, it tries lem and proposes several heuristic algorithms to approximate to achieve local optimization in each optimization unit and the optimal solution and reduce the computation complexity. then extends to global optimization. The DECOR framework Considering dynamics issues to estimate the flow importance can be applied to cSAMP, as cSAMP is also a resource and configure the routing table entries of both MeasuRouting assignment strategy. The comprehensive experiments conclude and MMPR, Dynamic Measurement aware Routing (DMR) that the DECOR-based cSAMP is superior to others. Xu et al. [50] summarizes these challenges and proposes solutions. [46] propose a new lightweight solution to flow distribution 3) Flow Distribution/Allocation: problems for collaborative traffic measurement in SDN. The A natural problem of collaborative measurement is how proposed framework minimizes the memory/space overhead to assign the flows to the measurement nodes/switches to on each switch with a single sampling probability value p. The ensure both load balance and measurement completeness. processing overhead is at most one hash operation per packet Such a problem is formulated as a flow distribution/allocation to implement sampling if the packet is not recorded by one of problem. By allocating each flow to a sequence of switches, the other switches on the routing path; if the packet is recorded every switch only measures a subset of flows to save memory earlier, the hash operation will not be performed. Thus, the and computation resources. Hitherto, there are a dozen works controller can decide the optimal Probability Assignment for that try to tackle this question. DCM [44], OpenWatch [39] Ingress Switches (PAIS) and Network-wide Switch Probability and LEISURE [40] focus on the measurement load balance Assignment (NSPA) collaborative measurement scenarios. strategies making. While the hash-based works [41] [42] [46] Compared with [39]–[44], up to now, the method proposed make efforts on lightweight implementation of strategies. in [46] is the most lightweight strategy to complete the flow Distributed and Collaborative Monitoring system (DCM) distribution problem. By allocating different hash range to a [44] leverages Bloom filters to record which flows should be series of switch, NSPA elegantly allocates the responsibility measured by the switch. It requires each switch to store two to the switch across the link. In this way, the additional Bloom filters, with the first one encoding the set of flows processing overhead is at most one hash operation per packet to be measured locally and the second one helping remove to implement sampling, and the additional space overhead on false positives from the first filter. However, the memory each switch is only a single sampling probability value p. overhead is significant because it takes a Bloom filter 10 bits per flow on average to ensure a false-positive ratio of B. Preparation for measurement data less than 1%. Moreover, the lookup of each Bloom filter Intuitively, after deploying a measurement environment, takes k hash operations and k memory accesses, where k measurement points can directly perform measurement tasks. is 7 for an optimal Bloom filter with 1% false positive However, in the high-speed network, performing measurement ratio. OpenWatch [39] designs an efficient method to offload on each packet is a heavy overhead task. Sampling, batch the measurement loads from the ingress switches to other processing, and randomization are powerful tools to reduce switches along with the routing paths. It leverages the global the processing overhead in various systems [51]. view of network topology and routing path for each flow 1) Traffic Sampling: to compute the optimal task assignment for all the switches Production network measurement tools, such as NetFlow in the network. By contrast, Load-EqualIzed meaSUREment [48] and sFlow [52], always sample the network flows for (LEISURE) [40] balances the network measurement across resource saving and fast processing. However, they incur low distributed monitors. Specifically, it considers various load- estimation accuracy since many flows are missed. Given a balancing problems under different objectives and studies their sampling probability of p, flow sampling means that every extensions to support different deployment scenarios. monitor processes one packet per 1/p packets in each flow. Several works leverage hash functions to tradeoff com- while packet sampling indicates that every monitor processes putation and space overhead when allocating flows across one packet per 1/p packets of all traffic. The linear sampling route paths. cSamp [41] is a centralized hash-based packet strategies can both miss flows and sample the same packets for selection system, which allows distributed monitors to mea- multiple times along the routing path. Some non-linear sam- sure disjoint sets of traffic without explicit communications. pling sketches have been proposed to overcome the drawback cSAMP specifies the set of flows identified by OD-pairs of uniform random sampling strategy. (origin-destination) that each measurement point is required To realize the constant relative error, small values of flow to record by considering a hybrid measurement objective size are incremented with high probability while large ones that maximizes the total flow-coverage subject. An improved with low probability. The authors in [53] propose to set the method cSamp-T [42] provides comparable measurement ca- packet sampling rate as a decreasing function of the flow sizes. pabilities to cSamp without relying on OD-pair identifiers. This strategy can significantly increase the packet sampling Each router only uses only local information, such as packet rate of small and medium flows by sampling less large flows. headers and local routing tables, rather than the global OD- By doing so, more accurate estimations of various network pair identifiers. DECOR [43] provides a solution to coordinate statistics can be guaranteed. However, the exact sizes of all network resources and avoids controller bottleneck, message flows are available only if we keep per-flow information for delay, and a single point of failure. DECOR divides a network every flow, which is prohibitively expensive for highspeed links. Sketch guided sampling (SGS) [53] solves the estimation a b c d a a c b a a 10 updates problem by using a synopsis data structure called counting Sketch sketch to estimate the approximate sizes of all flows. Adaptive Agg a 5 b 2 c 2 d 1 4 updates Random Sampling [54] bounds the sampling error (relative estimation error) within a pre-specified tolerance level and Fig. 6: Agg-Evict strategy proposes an adaptive random sampling technique to adjust the sampling probability and minimize the number of samples. Adaptive Non-Linear Sampling (ANLS) [55] updates the Generally speaking, sampling is the optimal strategy to counters of sketch with sampling probability function p(c), reduce the per-flow measurement. From the uniform sam- where c is the counter value and p(c) is a pre-defined function. pling, adaptive sampling [53] [54] [56] [55] [16] to separated In this work, the authors provide the general principles to estimated-bucket strategy [57], the estimation accuracy keep guide the selection of sampling function p(c) for sampling increasing. Although sampling still suffers from minor esti- probability adjustment. The intuition of ANLS is to implement mation errors, these strategies have played an essential role in a large sampling rate for small flows while a small sampling sketch-based network measurement. rate for large flows. The authors also derive the unbiased 2) Randomization Processing: flow size estimation, the bound of the relative error, and the Different from packet sampling, randomization processing bound of the required counter size for ANLS. It updates a is an alternative approach to reduce the packet processing counter from c to c+1 with probability p(c). Self-tuning ANLS overhead in which only the sampled counters are updated. [56] selects one specific sampling function according to the Randomized HHH (Hierarchical Heavy Hitters) [58] pro- principles ANLS [55] proposed. Set the p(c) as follow: poses a randomized constant-time algorithm for HHH. Un- like the deterministic algorithms whose update complexity 1 p (c) = c (2) is proportional to the hierarchy’s size H, RHHH randomly (1+a) −1 samples only a single prefix to update using its respective where a is a pre-defined parameter 0 < a < 1. This paper instance of heavy-hitters rather than updating all prefixes for focus on the method which adjusts the parameter a during each incoming packet. This randomization strategy decreases the statistic process in order to provide the best and ideal the update time from O(H) to O(1), but requires a specific tradeoff between accuracy and counting range. DISCO [16] number of packets to provide desired accuracy guarantees. further extends ANLS by regulating the counter value to be a NitroSketch [59] is a careful synthesis of rigorous yet real increasing concave function of the actual flow length to practical sketch design to reduce the number of per-packet support the counting of flow size, flow volume counting, and CPU and memory operations. Like the mechanism used in flow byte count simultaneously. Randomized HHH [58], NitroSketch only samples a small However, in order to accurately estimate the largest counter, portion of packets by geometric sampling, and the sampled the above methods compromise the accuracy of the smaller packets need to go through one hash computation, update to flows [57], where the sampling probability function can be one row of counters, and occasionally to a top-k structure. scaled to achieve a higher counting range at the cost of a The experiments demonstrate that the accuracy is comparable larger estimation error. Independent Counter Estimation Buck- to unmodified sketches while attaining up to two orders of ets (ICE-Buckets) [57] first presents a closed-form explicit magnitude speedup and a 45% reduction in CPU usage. representation of an optimal estimation function, which is used To catch up with the line rate and reduce the processing to determine these probabilities and estimate the real value of overhead, both RHHH [58] and NitroSketch [59] takes the a counter. ICE Buckets separate the flows into buckets and randomization strategy to update the sampled counters in their configure the optimal estimation function according to each sketches. Although they all reduce the updating overhead, bucket’s counting range, thereby significantly reducing the the corresponding query is complicated to cope with the overall error by efficiently utilizing multiple counter scales. incomplete recorded information, and a minimal volume of Both linear and nonlinear sampling may incur the problem packets is required to achieve the desired accuracy guarantee. that some flows are over-estimated (sampled more than the 3) Batch Processing: sampling probability), while some are under-estimated (sam- Due to the high speed of traffic packets and the per-flow pled less than the sampling probability). Starting with a simple processing requirement, the operational overhead is rather idea that “independent per-flow packet sampling provides the heavy. In contrast to the sampling strategy, the inspiration of most accurate estimation of each flows”, SketchFlow [51] batching is to aggregate the packets for the same flow and performs an approximated systematic sampling for flows to then update them together as a whole to reduce the overall provide a better tradeoff between accuracy and overhead for computation overhead, as illustrated in Fig. 6. a given sampling rate 1/p. This strategy is achieved by Randomized DRAM-based counter scheme [60] takes a new recognizing a sketch saturation event for a flow and only counter update by looking up the cache to see if there is sample the saturated packets. However, SketchFlow is only a already a waiting update request to the same counter. The general sampler strategy and can not be deployed to measure scheme modifies that request in the cache rather than creating the traffic individually. a new request (e.g., change the request from +1 to +n). Agg-Evict [61], a typical network measurement tool, also 1) Conventional hash-based data structures: employs the same idea to leverage the benefits of caching. It The existing sketches are mainly based on traditional constructs a data structure, which improves the efficiency of hash functions to maintain the mapping from flow keys to network traffic measurement in software and proposes a low- cells/buckets in the data structures. level mechanism to improve the efficiency of various schemes. Hash-based frequency estimation. We regard CM Sketch Skimmed Sketch [62] and Augmented Sketch (ASketch) [63] [27], BF [45] and Cuckoo filter [65] as the hash base for focus on aggregating the most frequent items in the filter frequency estimation. CM sketch [27] consists of d arrays, and evict the cold items into the second stage, while Cold A1, ..., Ad, each of which contains w counters. It hashes filter [64] captures the cold items in the first filter and evicts incoming packets into each row of the hash table and locates the hot items to the second stage. ASketch and Skimmed one counter in each row to update the multiplicity or size Sketch need two-direction communication between two stages information. Bloom filter [45] leverages the k bits in a vector to exchange elements because of the difficulty to capture hot to represent the membership of an element within set. Despite items accurately. However, Cold filter only needs one-direction the constant-time complexity and the space-efficiency features, communication and less memory access. Bloom filter cannot support the deletion and record the multi- The core idea in batch processing [60] [61] [63] [64] is to plicity of flows. To this end, Counting (CBF) [66] take the incoming packets for a specific flow into a batch and replaces each bit in the vector with a counter of multiple bits. then update them as a whole into the synopses data structure. Whenever a flow is mapped into cells, the k counters will be In this way, the memory access to the sketch is significantly incremented by the flow size. Then the deletion of elements is reduced. However, this strategy needs to augment an additional straightforwardly realized by decreasing the k corresponding cache to maintain the information of the previous packets’ counters. Cuckoo filter (CF) is a hash table with b buckets, and flow. each bucket has w slots to accommodate at most w elements. Fan et al. [65] further represents the elements by recording C. Summary and lessons learned their fingerprints in the cuckoo filter. They further apply the In this section, we cover the aspect of preparation before partial-key strategy to determine the candidate buckets during measurement, including measurement environment and mea- the insertion phase and locate the candidate bucket [67] during surement data preparation. By preparing the environment, each the reallocation phase. However, CF can not support constant- monitor is optimally placed or activated in the corresponding time insertion because of the unbalanced probe during its measurement point to get the best measurement performance reallocation phase. considering measurement cost. And all of them are allocated Instead of maintaining mapping from flow key to flow the responsibility of measurement tasks. By data preparation, counters directly, a novel strategy chooses to maintain mapping each measurement point performs the strategies on the raw from flow key to several bit/bits counters according to bit- traffic stream to reduce the measurement overhead. Note representation of flow key. Group Testing [68] arranges a that both sampling and randomization have to balance the number of items into a group together to find out active tradeoff between accuracy and processing overhead. After the items. It firstly decides which group the item belongs to preparation phase, the sketch algorithms can be installed and and then updates these group counters as follows. The first performed in each monitor point. counter records the total number of items and the following counters are increased according to bits representation of the IV. OPTIMIZATIONINSKETCHSTRUCTURE item. And the majority item can be identified by traversing the total counters. Fast sketch [69] takes the same strategy The sketches are responsible for recording the active flows as Group Testing [68]. It hashes each incoming packet into and the corresponding size or volume information. The flow several predefined numbers of rows and adds its size to the information is recorded in the sketch for later information first counter in each row. After that, the corresponding counters retrieval and extraction in the updating stage. As shown in also adds the flow size according to the bit representation of Fig. 7, a lot of strategies have optimized sketches in terms the quotient of the flow. By collecting and adding sketches of hashing strategy, counter level optimization and sketch from multiple switches, the first counter in each row is firstly level optimization. In this section, we will review these work checked to recover the heavy-change flows. SketchLearn [22] in detail with the consideration of performance optimization. also takes the bit-mapping strategy to extract large flows from the multi-level sketch and leaves the residual counters for A. Hashing strategy small flows to form Gaussian distributions. XY-Sketch [70] A sketch algorithm needs a feasible amount of space to store investigates the decomposition-and-recomposition framework the counters and a flow-to-counter association rule so that ar- to transform the problem of item frequency estimation into the riving packets can update corresponding counters at line speed. problem of probability estimation. Specifically, it decomposes This subsection gives a detailed summary of hashing strategies, each item into a sequence of sub-items and hashes them into including traditional and learn-based hash strategies. Almost corresponding rows of the sketch. When responding to a point all existing sketch-based network measurements follow these query, XY-Sketch recomposes the element and returns the different flow-to-counter rule designs as the basis. product of all the sub-items probabilities appearing in the data Fast sketch[69] OPA [182] CBF[66] FlowRadar[127] Spread Sketch [189] LDA[124] Univmon[140] SketchVisor[168] Defeat [129]

Sequential hashing[134] Bitcount[162] LTC [178] HeavyGuardian[119] MAPLE[180] RL-Sketch[126] IMP[132] WCSS[149] IBF[128] RCD[135] Functionality

enrichment Reversible Sketch [139] Elasticsketch [158] Modular Group SketchLearn[22] LD- MV- FineComb [182] hashing[139] Testing[68] Sketch[125] sketch[116]

CounterMap[230] SWAMP[147] PMC[143]

IBF[128] DCF[141] Univon[140] SBF[98]

SF

FCM[122]Cuckoo filter[65] -

TinyTable[101] CML Sketch[87] SALSA [105] ASketch[63] IBF[128] sketch[136] MRSCBF[131]

LSS

sandwiched MLsketch [165]

Virtual LBF[170] Bloom

Register [ ANLS[55] ABC[104] SAC[85] SA[84] Brick[99] 80 Space Pyramid[94] Sketch[97] Sharing[89] ] optimization OSM[188] RCC[106] DISCO[16] Counter IBF[128] PMC sketch Virtual HLL[91] BF. tree[93]

Diamond[96] CEDAR[212] Univmon[140] [143]

Cuckoo counter[107]

Michael[155] Sketch

ActiveCM+ [88]

Kraska

al.[154]

IBF[ SF SF sketch[136] Diamond HLL-TailCut+[76] CSketch[83] Visor[168] HeavyKeeper[118] ColdFilter[64]

Sketch[96] Braids[24] 128 ACE[145] RAP[111] LCF[214]

et ICE-Buckets[57] OM sketch[95] Heavy ] Improve Frequent[112] Conservative update[108] MultiResBitmap[71] accuracy Guardian[119] Refined LL[187] CountMax[115] CSM[109] HeavyGuardian[119] Sliding XY- HLL[77] Sketch [70] MRSCBF[131] CHHFR[191] LR(T)[215] Space Saving[113] HyperSight[150] Less memory CASE [174] NitroSketch [59] R HHH [58] FSS[142] RCC[106] access FSS[142]

Hashing strategy Counter level Sketch level Fig. 7: The taxonomy of the existing sketch methods. On the one head, the sketches are classified in three main aspects from the structure optimization perspective, i.e., hashing strategy, counter level optimization and sketch level optimization. On the other head, to improve the performance, dozens of variants devote themselves to improve the space efficiency, increase measurement accuracy, incur less memory access and enrich the functionalities. stream and the total number of items. and an efficient block coordinate descent algorithm to compute Hash-based cardinality estimation. To count the number of the near-optimal hashing scheme for the seen flows. A multi- distinct elements in a flow stream, different hash strategy was class classifier is responsible for mapping elements to counters proposed. Bitmap [71] maps all the stream uniformly among based on their features for the unseen flows. Fu et al. [80] a bit array. By encoding each element as an index of the bit put forward a class of locality-sensitive sketch (LSS) by array, replicated elements can be filtered automatically. How- formulating a theoretical equivalence relationship between the ever, the space of this strategy is linear to the cardinality [72]. sketching error and the approximation error of the k-means Another strategy is to prepare an array of sample buckets with clustering. Composed with a cuckoo table as a per-filter to a reducing sampling probability exponentially. Each bucket test if it is a seen flow and a cluster model to collect the flow is allocated a bit to record whether the bucket receives the with the closest cluster center for this flow, LSS can efficiently stream elements. By extracting information for this bit array, mitigate the error variance optimize the estimation. Hsu et PCSA [73], LogLog [74], HyperLogLog [75], HLL-TailCut+ al. [81] combines a neural network to predict the log of the [76], Sliding HyperLogLog [77] all take this underlying data packet counts for each flow. The heavy flows are separately structure as the base to estimate the cardinality. recorded from non-heavy ones by this learned oracle to reduce Recently, Zhou et al. [78] proposed a generalized sketch the interruption between them. Theoretical analysis proves that framework, including bSketch (based on the structure underly- the error of the learned CM sketch is up to a logarithmic factor ing the counting Bloom filter), cSketch (based on the structure smaller than that of its non-learning counterpart. Based on of CountMin) and vSketch (based on the memory sharing [81], Aamand [82] further provide a simple tight analysis of mechanism), which aims to incorporate the sketch design the expected error incurred by CM Sketch and the first error under a general implementation structure, with the flexibility bounds for both standard and learned version of CSketch [83]. of plug-n-play and many options for the tradeoff. Besides, to By such designs, the learning-based sketches show a new reverse flow key from sketches, a lot of work separately hash trend for measurement. We believe there will be a body of separated bit or a partition of flow key, instead of the whole such work in the future. Although learning-based mapping flow key, to record the statistical information of bit or bits. incurs less space overhead, the implementation of such designs The details can be referred to in Section V-B1. As the basis requires substantial computation resources, which may not of the sketch algorithm, flow-to-counter rule is the first step to be advisable for computation-scarce situations. Besides, the obtain the index of corresponding counters inside the sketch. updating of the out-of-data learning model is also no-trivial. 2) Learning-based mapping: Apart from the traditional hash strategies, a new trend of B. Counter level optimization combining learning-based methods, such as k-means cluster- Given the limited on-chip memory, if the counters in the ing and neural network, with the hash table is emerging to sketch use many bits, the number of counters will be small, reduce the space overhead and improve estimation accuracy. leading to poor estimation accuracy. In this case, most counters To compute the optimal mapping scheme, Bertsimas et al. only record mouse flows; hence, those untapped bits are [79] propose a mixed-integer linear optimization formulation, wasted. By contrast, if the counters occupy only a few bits, A mode

0 0 … 0 … 0 Estimator 1 Estimator 2 ...... sign part splitA digit countingmode part Estimator 1 Estimator 2 ...... 1 1 0 …0 0…1/0 0 … …1/0 0 Register array

sign part (a)split SAdigit counting part Register array

sign part:21 1 counting… 0 part:311/0 … 1/0 (a) Register sharing [89] increase the counting part 0 1 0 1 1 1 1 1 by 1 with probability 1/4 Estimator 1 Estimator 2 sign part:+1 counting part:0 ...... A mode 0 1 sign1 part:20 0 counting0 0 part:310 set the counting part to 0 Estimator 1 Estimator 2 increase the counting part 0 1 0 1 1 1 1 1 Bits pool ...... (b)0 When0 insert… an item0 (e,…1) into0 SAby 1 with probability 1/4 (b) Bits sharing [90] sign part:+1 counting part:0 split digit counting part Bits pool 0sign 1part1 0 0 0 0 0 set the counting part to 0 Fig. 9: Virtual register design 1 1 … 0 1/0 … 1/0

(c) Dynamic SA counter is a more generic technique to adapt to different Fig. 8: Self-adaptive counter [84] and small active counter [85] counting ranges. Let n denote the number of bits in a sign part:2 counting part:31 counter. For each counter, it has s-bits for sign and (n−s)- increase the counting part 0 1 0 1 1 1 1 1 by 1 with probability 1/4 bits for counting. Moreover, SA augments an expansion array, γ [0] , γ [1] , ··· , γ [k−1], where k=2s. When updating a SA the sketch fails tosign represent part:+1 elephantcounting flows.part:0 Thus, many efforts have been made to address this dilemma. counter with value 1, SA first gets the sign part s0, and then 0 1 1 0 0 0 0 0 set the counting part to 0 1 adds 1 to the counting part c0 with probability , as Fig. 1) Small counter for larger range: γ[s0] n−s It is space-efficient to use a small counter to keep approxi- 8(b) illustrated. If the counting part reaches 2 , the sign part mate counts of large numbers, as the resulting expected error will be increased by one, and the counting part will be set to can be rather precisely controlled. Firstly proposed in Approx- zero. Finally, the estimation can be calculated as follows in imate Counting [86], Count-Min-Log Sketch [87], SAC [85] Equ. 3 and Equ. 4: and SA [84] have led this idea to network measurement.  Traditional sketches use binary-based counters to record stage [0] =0 n−s Pi−1 (3) the frequency of flows. The counter is increased by 1 once in- stage [i] =2 × j=0 γ [j] , i > 0 coming a corresponding packet. CML Sketch [87] replaces the classical binary-based counters by log-based counting cells. estimate=c × γ [s ] + stage [s ] (4) The log-based counters increase with a predefined probability 0 0 0 x−c, where x is the log base and c is the current estimation. To address the inflexibility of the fixed-length sign part, Consequently, the log-based counter can use fewer bits and Dynamic SA (DSA) augments a split bit and dynamically increase the number of counters for the same storage space adjusts the sign bits, as shown in Fig. 8(c). The length of to improve the accuracy. However, CML Sketch sacrifices the sign part is initialized to 0. Except for the split digit, all the ability to support deletions for a more extensive counting other bits are used to count. With the value represented by range. the counter becomes larger, SA moves the place of split bit Small active counters (SAC) [85] divides a cell into two and increases the bits of sign part. The length of the sign part part: a k-bit estimation part A and a l-bit exponent part is increased to i, the counting part is shortened by i bit. The mode, as shown in Fig. 8(a). Like floating-point number stages change to: representation, the estimation of SAC is nˆ=A·2r·mode, and r is a global variable for all counters. An incoming packet with size c will cause the counter A update as follows.  stage [0] =0 If c≥2r·mode, SAC first increases A by b c c and then Pi−1 n−j−1 (5) 2r·mode stage [i] = j=0 γ [j] × 2 , i > 0 increases A by 1 with probability given by normalized value of the residue. If c<2r·mode, SAC increases A by 1 with In this way, DSA can accurately record mouse flows while c probability 2r·mode . When A overflows, the mode part will be being able to deal with elephant flows. increased. However, if mode overflows, renormalization step The above algorithms all focus on a single counter opti- will move the counting range to a higher scale, which is only mization for a more extensive counting range. CML Sketch supported by increasing the global variable r and renormalize leverages a fixed log base and updates counters according to all counters at once to avoid estimation collision. This up- the current count. In comparison, SAC and DSA take more scaling strategy is inflexible. ActiveCM [88] uses a variant of elastic strategies to enlarge the count range of little bits. All compressed counters in [85] and enlarges a 32-bit counter the these three estimators can represent large values with small maximum counting range, which is far greater than 232. symbols at the price of a minor error because they all increase Compared with SAC [85], Self-Adaptive counters (SA) [84] the hashed counters with probabilities. 2) Virtual register for spread estimation: High level counter In this subsection, we focus on spread estimation, which aims to record the distinct connections with hosts and takes Lower level counter cardinality estimation register to solve DDos, port scan, and super spreader detection. More works about cardinality es- Fig. 10: Hierarchical counter-sharing timation tasks are detailed in Section VI-A4. And the spread estimation can be divided into register sharing and bits sharing, as illustrated in Fig. 9. Counter Braids [24] compress counters into multi-layers For register-sharing. Virtual HyperLogLog Counter (VHC) by “braiding” a hierarchy of counters with random graphs. [91] shares HLL registers in a shared pool to measure flow Braiding results in drastic space reduction by sharing counters size. VHC contains an online encoding module to record among flows. A random mapping strategy is implemented to the packets in real-time, while the offline estimation module construct the relation between flows and the first-layer counters measures the size of all flows based on the counters recorded or two consecutive layers of counters. Multiple flows share by the online module. Furthermore, Virtual Register Sharing the common bits by sharing the high-layer counters to act as [89] dynamically creates an estimator for each flow by ran- the overflow counters. The low-layer and high-layer counters domly selecting several registers from a multi-bits registers function as an biger counters to record the size of flows. pool to estimate the cardinality of elephant flows. The register However, before querying a specific flow, the information of is the basic cardinality estimation unit, such as PCSA [73], all flows must be obtained in advance, and the decoding should LogLog [74] and HyperLogLog [75]. By sharing registers be executed offline in batching way. As a result, the query among many flows, as shown in Fig. 9(a), the space of register speed is significantly slowed down. is fully utilized. The array of registers are physical, while the Counter tree [93] is a two-dimensional counter sharing estimators are logical and created on the fly without additional scheme, including horizontal counter sharing and vertical memory allocation. counter sharing. Physical counters are logically organized in a tree structure with multiple layers to compose a new big For bits-sharing. Compact spread estimator (CSE) [90] virtual counter. Different child counters may share the same creates a virtual bit vector for each source by taking bits from a parent counters in vertical counter sharing. Virtual counters are shared pool of available bits. It counts the number of distinct shared among different flows in horizontal counter sharing. DstIP for each SrcIP and can be abstracted as multiple-set The virtual counter array for a particular flow consists of version of linear counting [92]. Every individual SrcIP has its multiple counters pseudo-randomly chosen from the virtual own virtual bits vector for linear counting, as illustrated in counters. However, as pointed in [91], the experiments based Fig. 9(b), a portion of which is randomly shared with another on real network trace data show that Counter tree cannot work SrcIP’s virtual vector. well under a tight memory space. Pyramid Sketch [94] further Both register-sharing and bits-sharing construct virtual esti- devises a framework that not only prevents counters from mators with common memory space. Using a compact memory overflowing without the need to know the frequency of the space can estimate the cardinality with a wide range and hottest item in advance but also can achieve high accuracy, reasonable accuracy guarantees. These space-efficient strate- update speed, and query speed simultaneously. It consists of gies enable on-chip implementation of spread estimation for multiple counter layers where each layer has half counters network measurement with line speed. of the last layer. The first layer only has pure counters to 3) Hierarchical counter-sharing for skewed traffic: record multiplicity, while other layers are composed of hybrid In conventional sketches, all the counters are allocated the counters. Different from the Bloom sketch [97], Pyramid same size and tailored in order to accommodate the maximum sketch uses two flag bit in hybrid counter for one child counter flow size. However, the elephant flows only occupy a small to record whether the child counter has overflowed or not. partition of the traffic. Therefore, the high-order bits in most Unfortunately, as pointed out in SA [84], for an elephant counters of conventional sketches are wasted. This waste leads flow, the hierarchical counter-sharing strategy needs to ac- to space inefficiency. To solve this problem, a lot of works, cess all layers, requiring many memory accesses for each such as Counter Braids [24], Counter Tree [93], Pyramid insertion and query. OM sketch [95] has multiple layers with Sketch [94], One memory access sketch (OM sketch) [95] a decreasing number of counters. When an overflow occurs and Diamond sketch [96] have designed hierarchical counter- the first time, the flag in the low-layer counter will be set. sharing scheme to record the flow size, as shown in Fig. To alleviate the speed decline in the worst case, OM sketch 10. They organize the counters as a hierarchical structure in only takes a two-layer counter array to record multiplicity which the higher layers possess fewer memory. The lower information. Moreover, OM Sketch leverages word constraint layer counters mainly record the information of mouse flows, to constrain the corresponding hashed counters within one while the high layers record the number of overflows at the or several machine words. To improve the accuracy of OM low layers (the significant bits of the elephant flows size). sketch, OM Sketch records the fingerprints of the overflowed Moreover, higher layer counters can be shared by multiple flows in their corresponding machine words at the lower flows in order to reduce space overhead. layer to distinguish them from non-overflowed flows during queries. Cold Filter [64] also consists of two counter layers. function and accesses memory in add/remove operations in Bloom sketch [97] and Pyramid Sketch [94] take flag bit to a serial manner. We also note that, Counting Quotient Filter record if the low layer has overflowed. By contrast, Cold (CQF) [102] augments variable-sized counters in quotient filter Filter will not change the low layer counters when the lower [103] just like TinyTable [101] does. To make CQF work, an counter overflows and take no flag bit. For hot items, the “escape sequence” is proposed to determine whether a slot size is the addition of two-layer counters. Furthermore, cold holds a remainder or part of a counter and how many slots filter takes one-memory-access strategy to handle the memory- are used by that counter. access bottleneck, and the details can be referred to in Section Bits borrow or combination. Adjacent Borrow and Carry IV-B5. (ABC) [104] takes a novel bit reorganization strategy, which is 4) Variable-width counters for space efficiency: designed for non-uniform multiset. The experiments on real- Unlike the fixed-width counter design, many efforts have world datasets show that when an elephant flow is recorded, designed variable-width counters to improve space efficiency the neighboring counters are often empty or occupied by and estimation accuracy under skewed traffic. mouse flows. As a result, the unused bits can be borrowed Counter resizability. Spectral bloom filters (SBF) [98] ex- by the adjacent elephant flows to enlarge their count ranges. tends bits vector of BF [45] as counters vector. The counters, By borrowing bits from the adjacent counter or combining the packed one after another, are located by a hierarchical indexing mapped counter with its adjacent counter into a big counter, structure for fast memory access. However, an update causing ABC can improve the memory efficiency at the cost of three the width of counter i to grow will cause a shift of counters bits to mark combined counters. However, the encoding of i+1, i+2, ··· , which can have a global cascading effect, ABC is cumbersome and slows down sketch significantly. making it prohibitively expensive when there can be millions Self-Adjusting Lean Streaming Analytics (SALSA) [105] dy- of counters [99]. Although the expected amortized cost per namically re-sizes counters by merging neighbors to represent update remains constant and the global cascading effect of larger numbers. Compared with ABC, SALSA only needs one counter shift is small in the average case, the worst case bit to mark combined counters and encodes more efficiently. cannot be tightly bounded. Therefore, SBF with variable- The variable-width counter provides more flexible and width encoding is not an effective counter solution as it elastic space efficiency for sketch design. With an overhead cannot ensure quick per-packet updates when a packet arrives. of complicated operations, bit-shifting expansion [98], rank- Bucketized Rank Indexed Counters (BRICK) [99] adopts a index-technique [99] [101], bit-borrowing/combination [104] sophisticated variable-width encoding of counters, which uses and fingerprint-counter chain [101] [102] all provide variable- fixed-size buckets and Rank Indexing Technique [100] to sized counters, which enables the counters to use substantially support variable-length counters with a restricted counter sum. less space for Zipfian and other skewed flow distributions. Specifically, it bundles a fixed number of counters randomly Such variable-width counters make a proper tradeoff between selected from the array into buckets. And Brick only allocates space efficiency and the encoding operation overhead. enough bits to each counters, i.e., if its current value is 5) Optimization for less memory access: c, Brick allocates dlg (c) + 1e bits to it. In each bucket, While sketches can provide an estimation from multiple Brick reorganizes counters into sub-counter arrays with a shared counters, they can not achieve both high accuracy decreasing number of counters. Moreover, multiple counters and line speed at the same time. Accessing multiple counters are composed of a new variable-width counter. Brick can requires multiple memory accesses and hash computations. hold more counters as the average counter is shorter than the Frequent memory access can become the bottleneck of sketch- largest one [57]. Unfortunately, with the average counter value based measurement. Several techniques have been developed increasing, the encoding becomes less efficient. to reduce memory access and hash computation when updating Bucket Expandability. Inspired by Rank Indexing Technique or querying sketches to handle this dilemma. [100], TinyTable [101] reorganizes space into chained finger- OM sketch [95] leverages the word acceleration strategy prints and counters. Using the same block for both fingerprints to constrain the corresponding counters within one or several and counters, Tinytable chains associate counter and finger- machine words and achieves nearly one memory access for prints where each chain always starts with a fingerprint and each insertion. Specifically, for the lower layer Ll, dl+1 hash that counterparts are always associated with the fingerprint functions are associated with it. The first hash functions are to the left if necessary. In order to handle bucket overflows, used to locate one machine word at the low layer, and the other TinyTable dynamically modifies the bucket size according to dl hash functions are used to locate the dl counters in this the actual load in that specific bucket. If there is not enough machine word. For the higher layer Lh, dh+2 hash functions space in the bucket, Tinytable will borrow space from the are associated with it. The first two is used to following buckets by the Bucket Expand operation, which locate two machine words at the high layer, and the other takes its inspiration from a linear probing hash table. This dh hash functions are used to locate the dh counters in this strategy may cause the bucket to overflow again, in that case, machine word. Cold filter [64] also proposes its one-memory- TinyTable repeats the process. However, the complexity of access strategy to access the low layer L1. Specifically, it additions and removals increases with the table load. Com- confines the d1 counters within a machine word of W bits to pared with SBF [98], TinyTable only requires a single hash reduce the memory access. Cold filter further splits the value produced by a hash function into multiple segments, and each randomized method for top-k identification and frequency segment is used to locate a machine word or a counter. In this estimation. When the table is full, RAP kicks out the item 1 way, OM Sketch and Cold filter can locate multiple counters c with the minimum counter value cm with probability ; cm+1 within one machine word. To reduce the time consumption of otherwise, the incoming item will be discarded. Frequent de- memory access, Recyclable Counter with Confinement (RCC) creases all counters and evicts the zero-number flows to make [106] also limits a virtual vector to one memory block so space for new elements. While Space Saving [113] always that it requires only one memory block access to read and evicts the item with the minimal count directly. Unbiased write a virtual vector. To maintain only one memory access for space saving [114] has more robust frequent item estimation each bucket operation, Cuckoo counter [107] configures each properties than Sample and Hold and gives unbiased estimates bucket as 64 bits. Furthermore, to efficiently handle skewed for any subset-sum. CountMax [115] and MV-sketch [116] data streams, Cuckoo counter implements entries with unequal apply the majority vote algorithm (MJRTY) [117] to track lengths to insulate mice flows from elephant flows. the candidate heavy flow in each bucket. When the incoming By confining the counters for a specific flow within one or packet is the same as the recorded, the corresponding counter several words, sketches can limit the memory access and speed will be increased; otherwise, decreased. Furthermore, the up insertion and query overhead. The above sketches achieve heavy candidate item will be replaced by the new item if the less memory access overhead at the cost of lower accuracy. counter is negative. HeavyKeeper [118] and HeavyGuardian 6) Counter update strategies for higher accuracy: [119] leverage a probabilistic method called “exponential- Novel counter updating strategies are proposed to improve weakening decay”. When the incoming item does not match the estimation accuracy of sketches. In this subsection, we the stored items, the flow size will decay with a probability. As review such strategies by classifying them as CM-Sketch a result, the elephants will be stored in the buckets, and the variants, top-k detection and various-hash design. mouses will be decayed and replaced with high probability. CM-Sketch variants. Conservative update [108] increases The core idea of these methods is to treat each element as the counters as less as possible. The intuition is that, since the hot item and insert them into the top-k data structure. After point query returns the minimum of all the d values, CM the data structure space runs out, it gradually kicks out cold should update a counter only when it is necessary. This elements with a certain probability. However, as most elements design avoids unnecessary update of counter values to lessen are cold ones, processing so many cold items incurs not only the over-estimation problem. CM-CU sketch also applies a extra overhead but also causes significant inaccuracy for the conservative update method. Specifically, CM-CU updates a ranking and frequency estimation of top-k items [120]. flow f with size c as max {sk [k, hk (f)] , bc (f) +c}, where Various-hash design. Weighted Bloom filter [121] allocates sk [k, hk (f)] represents the value of counters before updating, ke hash functions for each element e depending on the query and bc (f) is the query result of flow f before updating, frequency and its likelihood of being a member. This strategy i.e., ∀ 1 ≤ k ≤ d, bc (f) =min{sk [k, hk(f)]}. SBF [98] can also be gracefully integrated into the frequency estimators also optimizes with the “minimum increase” scheme. The to maintain a set of elephant elements. Frequency-aware minimum increase scheme prefers conservative insertion, i.e., Counting (FCM) [122] addresses the problem of inaccuracy of only increase the smallest counter(s). Count sketch (CSketch) low frequency flows. FCM uses many hash functions per item th [83] introduces one more hash function gk() for the k by dynamically identifying low- and high-frequency stream array to map flows onto {+1, −1}. When updating a flow items. The low- and high-frequency items are classified by f with size c, the corresponding counter will be increased MG counter [123] which keeps track of counts of unique by c · gk (). CSketch returns the median over d counters, i.e., items. This strategy is realized by setting two additional hash ∀ 1 ≤ k ≤ d, median{sk [k, hk(f)] · gk(f)}, as an unbiased functions to determine the initial offset in rows and the gaps estimate of the point query, where sk [k, hk(f)] represents the between two adjacent rows. To make high-frequency items count of flow f in the kth array. Counter Sum estimation update fewer counters than low-frequency items, the gap of Method (CSM sketch) [109] splits hot items into small pieces low-frequency items is smaller than the high-frequency items. and stores them into small counters. Flows are hashed into Thus the accuracy of low-frequency items will be increased. l counters, and the flow size is divided into l roughly-equal Besides, there are various specific counter updating strate- shares, each of which is stored in one counter. The CSM gies, such as the counter updating related to the aforemen- sketch randomly increments one of the hashed counters during tioned four Section IV-B1,IV-B2,IV-B3,IV-B4, sampling in insertions and reports the sum of all the hashed counters Section III-B1, and randomization in Section III-B2. The subtracted by the noise during queries. Reviriego et al. [110] details can be referred in the corresponding subsection. first propose protection techniques to minimize the soft errors 7) Slot enforcement for various functionalities: that flips the contents of memory cells. They evaluate the effect As shown in Fig. 11, because of the various measurement of soft errors on CM Sketch and propose to leverage upper requirement, sketches have enforced their buckets with other counter bit encoding a single parity bit per counter to detect cells to support different additional functions. By augmenting errors. timestamp, packet-header field counter, or key-XOR cell, the Top-k detection. Randomized admission policy (RAP) [111] sketch can support latency detection, heavy hitter traction, and and Frequent [112] update the fixed key-value table by a key-enumeration functionality. LDA heavy part and several small counters as light part. A heavy timestamp Packet counter Track fine-grained latencies part is used to precisely store frequencies of hot items, and a LD-Sketch light part is used to store the frequencies of cold items approx- key-value array array length estimation error Heavy key detection imately. The heavy part takes Exponential Decay strategy to FlowRadar decrement the count of the weakest items with a probability. In FlowXOR FlowCount PacketCount Network-wide flow decoding this way, HeavyGuardian can intelligently separate and guard HeavyGuardian the information of hot items while approximately record the elephant guardian mouse Separate and guard elephants frequencies of cold items. MV-Sketch Invertible sketch for Key-reversible functionality. FlowRadar [127] extends the elephant guardian mouse heavy Flow Detection Invertible Bloom Filter (IBF) [128] design with three fields, Fig. 11: Slot enforcement strategy FlowXOR, PacketCount, and FlowCount. It records the num- ber of flows and number of packets that have been hashed into and XORs the corresponding flow identifiers together in a cell. When a packet arrives, FlowRadar first checks the Latency detection. Lossy Difference Aggregator (LDA) flow filter to see if the flow has been stored in the flow set or [124] is designed to measure packet loss and delay over short not. If the packet belongs to a new flow, all three fields will time scales. LDA maintains timestamp accumulator-counter be updated. Otherwise, only PacketCount will be increased. pairs in an array. The packets are divided into groups according When the collector receives the encoded flows, it takes the to the hash function. For each group, the sum of timestamps decoding strategy to derive the flow ID reversely. Specifically, and the total count of packets are sent to the receiver. Upon it can decode the per-flow counters by looking for cells that receiving such information, the receiver first applies the same include a single flow (called pure sell). By locating the cells, hash function to divide packets into groups. Suppose the total including this flow, FlowRadar removes them from flow set. packet counter of the group at the receiver matches that at This process ends when there are no pure cells. the sender, the receiver can calculate the average delay for By augmenting the counter with other fields, sketches such a group. Otherwise, the receiver will discard such a can support various functionalities. In the future, with the group. Therefore, LDA significantly reduces the measurement development of the network, there will be new-generated overhead by only transmitting the sum of timestamps and functionalities that require other fields in a cell/bucket. packet counter. However, when packets are lost or reordered in a group, the entire group should be discarded; thus the delay C. Sketch level optimization for such a group cannot be estimated. After summarizing the counter-level optimization, the base Heavy hitter traction. LD-Sketch [125] augments each of sketch, we will get into details of the integral-sketch design. bucket Si,j with one additional component: Ai,j, an associa- The sketch-level optimization can be divided into multi-layer tive array to keep track of the hashed heavy key candidates, composition, multi-sketch composition and sliding window and three parameters: Vi,j is incremented by the size of each design. The researchers have developed novel algorithms to incoming flow; li,j, the maximum length of the array; and enlarge the sketch functions skillfully by composing multiple ei,j, the maximum estimation error for the true sums of the basic units or different sketches together. keys hashed to the buckets. To keep tracking of only heavy 1) Multi-layer composition: flows candidates, LD-Sketch need to create or delete the key- Many work have proposed multi-layer design based on one value in the associative array and li,j=b(k + 1) (k + 2)−1c, in basic unit to enable different measurement tasks. We list the which k=bVi,j/T c. However, the detection accuracy declines related work in Table. III, which contains the basic unit and dramatically when the frequency of heavy flows is high [126]. the enabled function. At the beginning of detection, LD-Sketch is prone to cause For entropy estimation. Defeat [129] leverages the entropy frequent jitter of arrays. Moreover, the detection accuracy of the empirical distribution of traffic features to detect unusual and memory are sensitive to the ordering of incoming keys. traffic patterns. It treats the abnormalities as unusual distribu- The later work [125] proposes three enhancement heuristics tions of these features. Defeat generates multiple sketches of to address this problem. RL-Sketch [126] further uses Rein- SrcIP, SrcPort, DstIP, DstPort separately. Moreover, multiple forcement Learning to evict trivial flows while leaving heavy histogram-sketches from different switches are summed to key candidates. MV-sketch [116] augments each bucket with form a global sketch, which can be utilized to detect, identify, three components: Vi,j, the total sum of values of all flows and classify attacks. However, the complex construction of em- hashed to the buckets; Ki,j the current candidate heavy flow pirical histograms makes it less appealing for online realization key in the bucket; Ci,j, which indicates if the candidate heavy [138]. Intersection Measurable Property (IMP Sketch) [132] flow should be kept or replaced. Using a sufficient number observes that the entropy of an OD flow can be estimated by of buckets, MV-sketch can significantly reduce the probability a function of just two different Lp norms (Lp1,Lp2, p1 6= p2) that two heavy flows are hashed into the same bucket and of OD flows. IMP can deduce the entropy of the intersection hence accurately track multiple heavy flows. HeavyGuardian of two streams A and B from the sketches of A and B. [119] enlarges each bucket to store multiple key-value pairs as Specifically, every participating ingress and egress node in the TABLE III: Multi-layer design

Methods Basic unit Enabled function Defeat [129] Histogram-sketch Apply the subspace method [130] to detect anomalies MRSCBF [131] SCBF [131] Improve the estimate accuracy by different resolution IMP [132] Lp Sketch [133] Deduce the entropy of the intersection of two traffic Sequential Hashing [134] Hash table Enable key reversible function Multi-layer RCD sketch [135] Bit array Reverse the host with large connection degree SF-sketch [136] CM Sketch, Count Sketch Maintain a small sketch for transmission HashPipe [137] match-action hash table Use P4Switch to HH detection SketchLearn [22] Bit-level sketch Leverage multiple bit-level Gaussian distributions Diamond sketch [96] Atom sketches Perform per-flow measurement on skewed traffic network maintains a sketch of the traffic passing by it during composed of zero, one and ∗. By estimating frequencies of each measurement interval. When measuring the entropy of each k-bit using maximum likelihood estimation, SketchLearn an OD flow between an ingress i and an egress point j, IMP gets frequency estimate of candidate flows. collects the sketches of ingress stream O and egress stream i Reversible connection degree (RCD Sketch) [135] is a D . It leverages L sketch [133] to derive the L norm of j p p new data streaming method for locating hosts with a large the traffic stream based on the property of the median and connection degree. Based on the remainder characteristics of estimates the entropy of OD flows accordingly. the number theory, the in-degree/out-degree with a given host For more accurate. Diamond Sketch [96] composes atom can be accurately estimated. A connection degree sketch is sketches (counter array) into three parts: increment part, carry denoted as B= (B1, ··· ,BH ). Bi (1 ≤ i ≤ H) is a v × mi part, and deletion part. The increment part records the size bit arrays. Bi [j][k] (0 ≤ j < v, 0 ≤ k < mi) associates with of the flow. The carry part records the overflow depth of a hash function hi : {0, 1, ··· , n−1} → {0, 1, ··· , mi−1}, each flow. The deletion part is used to support deletions. where n is the size of the source space of packets. All It assigns an appropriate number of atom sketches to the Bi [j][k] (0 ≤ j < v, 0 ≤ k < mi) share a hash function fi : elephants and mouses dynamically. This strategy improves {0, 1, ··· ,E−1} → {0, 1, ··· , v−1}, where E is the size the accuracy considerably while keeping a comparable speed. of combination of source and destination space, and H is Slim-Fat (SF sketch) [136] takes a small-large sketch compo- the number of data array. The bits arrays are all set zero sition to relax bandwidth requirement when the local monitor at the beginning of measurement epoch. For an incoming transfers information to the remote controller. The large-sketch packet pi=(si, di), the sketch will be updated as, ∀1 ≤ is used to assist small-sketch during insertions and deletions k ≤ H,Bk [f (si||di)] [hk (si)] =1. Thus, for any source to improve accuracy within a limited space. In contrast, s, the packets with source s are all hashed to column the small sketch is transferred and used to answer queries. Bi (s) =Bi [·][hi (s)] (1 ≤ i ≤ H) in the sketch. Thus we can Multi-Resolution Space-Code Bloom Filter (MRSCBF) [131] obtain the sum of the out-degrees of source s. Moreover, the employs multiple SCBFs but operating at different sampling host addresses which are associated with large in-degree/out- resolutions. SCBF can approximately represent traffic flows as degree can be reconstructed by simple equation based on a multiset. With different sampling probability, the elements Chinese Reminder Theory without using address information. with low multiplicities will be estimated by SCBF of higher Sequential Hashing [134] constructs reversible multi-level resolutions, while high multiplicities will be estimated by the hash arrays for fast and accurate detection of heavy items. lower resolution SCBF. However, MRSCBF is very slow in The idea of Sequential Hashing is that combining the heavy terms of memory access and hash computations. And the sub-bits of key can efficiently discover the full heavy items’ bitmap nature of MRSCBF determines that it is not memory key. The original problem is divided into nested sub-problems. efficient for counting [93]. The prefixes with different lengths of flow key are extracted as For reversible design. SketchLearn [22] maintains l+1 the sub-keys. It enumerates each sub-key space and combines levels of sketch, each level corresponds to a sketch formed the recovered sub-keys to form the heavy flows. However, the by a counter matrix with r rows and c columns. The level-0 update costs of both Reversible Sketch [139] and Sequential sketch records the statistics of all packets, while the level-k Hashing [134] increase with the key length. Both of them sketch for 1 ≤ k ≤ l records the statistics for the kth bit of utilize the position information to recover the keys in their sketch. The running time to recover keys in them cannot the flowkey. Vi,j[k] represents the counter value of the level- k sketch at ith row and jth column. SketchLearn focuses achieve the sub-linear bound [69]. Vi,j [k] on Ri,j [k] = , which denotes the ratio of the counter log(n) Vi,j [0] For universal sketching. Univmon [140] maintains value Vi,j[k] to the overall frequency Vi,j[0] of all flows parallel copies of a “L2-heavy hitter” (L2-HH) sketch instance, hashed to stack (i, j). By leveraging the Gaussian distribution where n is unique elements in traffic flow stream. Univmon assumption, SketchLearn separates the large and small flows. creates substreams of decreasing lengths as the jth instance is Specifically, it uses Bayes’ theorem to estimate the bit-level expected to have all of the hash functions agree to sample half th probabilities pˆ[k], and further obtains a template flowkey as often as the (j−1) sketch. Each L2-HH instance outputs TABLE IV: Multi-sketch design

Methods Combined methods Enabled function Count-Min Heap [27] Count-min Sketch+heap Find top-k elements DCF [141] SBF [98]+CBF [66] Represent the multisets compactly Multi-sketch FSS [142] Bitmap filter+Space Saving [113] Minimize updates on the top-k list PMC [143] FM sketch [73]+HitCounting [92] High-speed per-flow measurement ASketch [63] Filter+Sketch(CM [83], Count sketch [27]) Improve accuracy and overall throughput Bloom sketch [97] BF [45]+multi-levels (Section IV-B3) Memory-efficient counting

L2 heavy hitters Q (Q0, ··· , Qlog(n)) and their estimated is full. The heap is kept small by checking that the estimated counts. By leveraging the results from the L2-HH instance, count for the item with the lowest count is above the threshold. Univmon can use the Recursive Sum Algorithm from the If not, the item is deleted from the heap. At the end of each theory of universal sketching to get an unbiased estimator of epoch, all the heavy hitters can be output by scanning the G-sum functions of interest. Specifically, it recursively applies whole heap. ActiveCM+ [88] also augments min-heap to track the result from Qlog(n) to Q0 sketch to function g to form the top-k heavy hitters within each sub-sketch. the interested G-sum using these sampled streams. However, For filtering packets. Filtered Space-Saving (FSS) [142] UnivMon incurs large and variable per-packet processing over- augments Space Saving with a pre-filtering approach. A head, resulting in a significant throughput bottleneck in high- bitmap is used to filter and minimize updates on the measured rate packet streaming. ActiveCM+ [88] proposes a progressive list. When a new item is incoming, the bitmap counter is first sampling techniques to solve this problem. Specifically, it checked if there are already measured items in the bitmap; and only updates the last sketch where flow f is sampled instead the measured list is searched to see if this element is already of variably multiple sub-sketches. Moreover, the progressive there. The update is conducted according to the check result sampling also reduces the moment estimation error by more about the bitmap and measured list. However, FSS needs to than half that UnivMon. In this way, ActiveCM+ reduces search a certain flow key or to identify the minimum counter in memory footprint and improves measurement accuracy. the list. It needs to traverse the whole heap in the worst case 2) Multi-sketch composition: without additional memory support. ASketch [63] augments By composing different data structures, sketch-based meth- a pre-filtering stage to identify the elephants and leaves the ods can further support various functions. We also list the mouses to sketch. Based on the skewness of the underlying related work and the corresponding basic sketch in Table. IV. stream, Asketch improves the frequency estimation accuracy For memory efficiency. Dynamic count filters (DCF) [141] for the most frequent items by filtering them out earlier. For extends the concept of SBF [98] and improves the memory the items that are stored in the sketch, Asketch reduces their efficiency of SBF by using two filters. The DCF is designed for collisions with the high-frequency items, as the items stored speed and adaptiveness in a straightforward way. It captures in the filter are no longer hashed into the sketch. This reduces the best of SBF and CBF, the dynamic counters from SBF, the possibility that a low-frequency item would appear as a and the fast memory access from CBF. The first filter is high-frequency item, therefore reducing the misclassification composed of fixed size counters. The novelty is that the size of error. Bloom Sketch [97] is composed of multiple layers of counters in the second filter is dynamically adjusted to record sketch+BF composition. The low layers are responsible for the corresponding counters’ overflow times in the first filter. low-frequency items, and the high layers process the items Unfortunately, the two filters increase the complexity of DCF, whose multiplicity can not be counted by the lower sketches. which degrades its query and update speed [136]. Moreover, the BFs record whether the high layers record the For hot-cold separation. Probabilistic Multiplicity Count- items. It leverages BF to recognize items whose multiplicity ing (PMC) [143] leverages FM sketch [73] to measure hot is not successfully recorded in the low layer. By this way, items and modified HitCounting [92] to estimate cold items. the low-frequency items are recorded in the low layer, while Specifically, the FM sketches of all estimators share their bits high-frequency items are recorded in both low and high layers. from a common bit pool uniformly at random so that mostly Sketchtree [144] comprises multiple filters, each associated unused higher-order bits in the registers can be utilized. PMC with a specific task to measure both elephants’ flows efficiently can record information on a passing-by packet by setting only and aggregate mice flow. Specifically, Sketchtree is task- one single bit in the packet header field. Based on the packet oriented, and each filter is responsible for sending the related header, a specific hashing mechanism determines the enabled flows to the corresponding sketch and others to the next filter bit position. Moreover, PMC is designed initially to estimate for other measurement tasks. flow size, but it can be easily modified for flow cardinality 3) Sliding window design: estimation, which other flow-size estimators do not support. Traditional sketch-based methods focus on estimating flow Count-Min Heap [27] augments CM Sketch with a heap used sizes/volumes from the beginning of traffic stream (landmark to track all heavy flows and their estimated frequency. If the window model) [145]. In this model, beginning at a “land- incoming flow whose estimation exceeds the threshold, it is mark” time point, sketch-based methods mainly focus on the added to the heap or replaces the minimum flow when the heap data which falls between the landmark and the current time point. With time passing, more and more packets pass through Randomized aging design. Aging Counter Estimation the monitor, the sketch runs out of space and has to be (ACE) [145] maintains the most recent W elements and adopts reset periodically [146], which is the intrinsic disadvantage of the counter sharing idea [109] to the sliding window model. this model. For many real-time applications, the most recent For ACE, an aging algorithm is proposed to eliminate one elements of a stream are more significant than that arrived element when a new element comes. It randomly picks a a long time ago. This requirement gives rise to the sliding counter in the array and decreases it by one if applicable. It is window sketch-model. Based on the sketch-based methods, the simple and efficient. However, as t becomes large, the sliding sliding strategy removes the expired elements as the incoming window accumulates many expired elements, introducing more of new elements, thereby it always maintains the most recent noise for size estimation. Memento [151] is a family of sliding W elements in the data stream. A lot of effort has been window algorithms for the HH and HHH problems in the invested in this area to get the most recent information. single-device and network-wide settings. Specifically, for each First-in-first-out design. This strategy inserts the new el- packet, Memeto performs a Full update operation (delete the ements and evicts the stale ones. Sliding HyperLogLog [77] stale data and add the new) with a probability or make a aims to estimate the number of distinct flows over the last w window update (delete outdated data). unit of time, which is smaller than the time window W .A Time adaptive updating. Ada-Sketch [152] proposes a time list of arrival timestamps and the position of the leftmost 1- adaptive scheme which is inspired by the well-known digital bit in the binary representation of the hashed value associated Dolby noise reduction procedure. Specifically, when updating, with the packet are maintained to record the recent packets’ Ada-Sketch applies pre-emphasis and artificially inflates the information in W time units. When packets are out of the counts of more recent items compared to the older items, i.e., t window, they are evicted from the list, and the incoming item multiplying the updates ci with f(t), which is a monotonically is updated. Although maintaining the timestamp exactly is the increasing function of time t. Then when querying, Ada- most accurate strategy for the recent information, it incurs a Sketch applies de-emphasis to estimate the frequency of item heavy space overhead. The randomized aging strategy may be i at time t, i.e., dividing the query results by f(t). a better tradeoff between the estimation accuracy and mainte- Scanning aging design. Sliding sketch [153] maintains nance overhead. Sliding Window Approximate Measurement multiple counters in each cell and updates the most recent Protocol (SWAMP) [147] stores flow fingerprints in a cyclic counter. Unlike the above strategies, Sliding sketch adapts buffer and the frequencies are maintained in TinyTable [101]. scanning operation to delete the out-dated by using a scanning The cyclic corresponds to a measurement window W . Within pointer upon the sketch. The speed of the scanning pointer the insertion of a new element, its fingerprint replaces the is determined by the length of the sliding window. However, oldest items in the buffer; furthermore, the corresponding the process of scanning pointer will influence the accuracy of count in TinyTable will be decreased, while the counter of estimation a lot. arriving flow will be increased. D. Summary and lessons learned Partitioned blocks design. By applying partitioned blocks to record the information, the sketch can randomly perform the In this section, we detail the structure optimization in terms aging operation in the oldest block and update the operation in of hashing strategy, counter level optimization and sketch level the newest block. Hokusai [148] uses a series of CM Sketch for optimization. Until now, there are a lot of work developed to different time intervals. It uses large sketches for the recent enforce the sketches. From optimization techniques from flow- intervals and small sketches for older intervals to adapt the to-counter rules and basic counter unit to the whole sketch, error within a limited space. As time passing, a large sketch the sketches can be updated with the techniques described in can be halved into a small sketch by adding one half of this section. In the future, we believe learned hash-map [154] the sketch to the other half. Window Compact Space-Saving [155] [156] will play an important role in flow-to-counter (WCSS) [149] divides traffic stream into frames of size W . rules design because of less space overhead and hash collision It maintains a queue for each block that overlaps with the errors. And the novel counter and sketch architecture design current window. When a block no longer overlaps with the will be guided by traffic characteristics and the development window, the related queue will be removed. This strategy keeps of architecture in the network. the number of queues constant at all times. WCSS supports V. OPTIMIZATION IN POST-PROCESSING STAGE point queries with constant-time complexity under the sliding After recording traffic statistics into sketches, we must per- window model. Segment Aging Counter Estimation (S-ACE) form some operations before obtaining valuable information [145] uses multiple data synopses to store the elements arriv- from the sketches, such as sketch compression and merging. ing in different window segments. This design can maintain Furthermore, some novel information extraction techniques are the relative order between the window segments with their data proposed to enrich the sketch functionality. In this section, we synopses and improve the probability of deleting the correct survey the optimization in the post-processing stage. outdated elements. HyperSight [150] also applies a partitioned Bloom filter to maintain a coarse-grained time order for testing A. Techniques for sketch compression and merging packet behavior changes. However, this kind of strategy keeps Although sketches have been designed as synopses that too many sketches which incur expensive memory overhead. require only a few memory accesses for each insertion, there 1 2 3 SUM 5 4 9

6 3 9

1 2 1 MIN 5 4 4

6 3 3

rate), an elastic sketch splits S into z equal divisions or sub- sketches, each of which has a size of w0×d. By combining  k Compress the same index in its division ( Si [j] k=1,··· ,z) in the same group, an elastic sketch builds a compressed sketch B and a set z  k Bi [j] =OPk=1 Si [j] (1 ≤ i ≤ d, 1 ≤ j ≤ z), where OP is the compression operation (e.g., Max or Sum). The authors (a) Compression operation proved that after Sum compression, the error bound of the

1 2 3 CM sketch does not change, whereas after Max compression, SUM 5 4 9 the error bound becomes tighter. Same as Elastic Sketch

6 3 9 [158], Cluster-reduce [159] also divides counters into groups based on adjacency. After that, the counters in the nearby (b) SUM merging operation 1 2 1 groups which have similar values are rearranged into clusters. MIN And each cluster are then reduced to a unique representative Fig.5 12: Sketch merging4 and compression4 6 3 3 value. Cluster-reduce achieves high efficiency of compression procedure, support of direct query without decompression, and high accuracy of compressed sketches simultaneously. remains scope for improvement in terms of reducing the Compression for time-adaptive updating. Hokusai [148] bandwidth bottleneck. Remote controllers require high-speed maintains a series of CM sketches for different intervals to memory to receive filled sketches from multiple measurement Compress estimate the frequency of any item for a given time or interval. nodes. Consequently, (1) the size of the filled sketch that The sketch space for recent intervals is larger than that for each measurement node sends to the collector should be older intervals so as to make the sketch time-adaptive. As extremely small, and (2) the compressed or merged sketch time passes, recent sketches are updated with incoming traffic should contain sufficient information that allows the remote and older sketches are derived by compressing the most recent collector to answer queries accurately. The accuracy with sketches one by one. Specifically, a sketch is compressed by which a sketch can provide answers to queries quantifies how adding one half of the sketch to the other half. Although this close the value of the estimation from the sketch is to the actual is a natural solution for achieving time adaptability, it still value of the frequency [136]. Several studies have proposed has several shortcomings, such as discontinuity, inflexibility, merging or compressing sketches before sending them to the excessive sketches, and large overhead for shrinking [152]. collector to reduce bandwidth overhead. 2) Sketch merging: 1) Sketch compression: For collaborative measurement, distributed sketches must be Several studies have proposed compressing sketches, as merged into a large summary, as shown in Fig. 12(b). shown in Fig. 12(a). Some studies have proposed compressing Merging for CM-based sketches. For merging, homoge- sketches when adapting to the available bandwidth before neous sketches can be added or subtracted directly using sending them in order to save space for time-adaptive updating. the same hash functions and structure for all the sketches, Compression for cardinality estimation. Arithmetic- or isomeric sketches can be used to find the corresponding coding-based compression [157] uses arithmetic coding to counters to obtain the final result with different hash functions compress FM sketches [73] and HyperLogLog [75] in a bit- and sketch sizes. SketCh REsource Allocation for Measure- wise and register-wise manner, respectively. The compression ment (SCREAM) [26] allocates different sketch sizes at each scheme can be understood as a separation of the represented switch so that each sketch may have an array of different value and the significantly larger duplicate insensitivity in- widths and cannot be summed directly. SCREAM sums the formation. However, this approach was only designed for minimum within each sketch to formulate the merged sketch cardinality estimation sketches, and it does not meet the in a distributed measurement by finding the corresponding requirements for compressing CM sketches. counters for each flow. Compression for transformation. SF-sketch [136] main- Merging for top-k summary. Mergeable summaries [160] tains a fat sketch and a slim sketch at a single measurement such as MG summary [123] and Space Saving [113] are point. The fat sketch is responsible for recording traffic flow mergeable and isomorphic. For tracking the top-k flows, two accurately and allows the slim sketch to improve accuracy with top-k records can be merged by subtracting the MG counters, less space overhead. When inserting or deleting an item, SF- which are lower than the (k + 1)th counter in the merged MG sketch first updates the fat sketch and then updates the slim summary. The procedure can be completed with a constant sketch based on the estimation from the fat sketch. With this number of sorts and scans of summaries of size O(k). design, the traffic flow information can be compressed into a Inspired by In-band Network Telemetry (INT), Light- smaller sketch for transmission. An elastic sketch [158] first Guardian [161] divides sketch into sketchlets, which can groups the counters in the CM sketch and then compresses be encoded into packets headers for further aggregation, the counters in the same group into a single counter. Given reconstruction and analysis in the end-hosts. By this design, a sketch S of size zw0 × d (where w=w0×z is the width, d LightGuardian can collect per-flow per-hop information for all is the depth, and z is an integer representing the compression the flows in the network with negligible overhead. B. Techniques for information extraction multi-level sketches and leaves the residual counters for small flows to form Gaussian distributions. 1) Flow key reversible techniques: Sub-key reversibility. Reversible Sketch [139] finds heavy Traditional methods only record the volume information flows by pruning the enumeration space of flow keys. It of flow into sketches according to the hash result of flow divides a flow key into smaller sub-keys that are hashed keys and provide a fast scheme for insertion and querying independently and concatenates the hash results to identify the of traffic flow. However, no scheme can efficiently recover hashed buckets. To recover heavy flows, it enumerates each the flow key from the sketch itself except by enumerating the sub-key space and combines the recovered sub-keys to form flow key space. Both direct recordings of the flow key and the heavy flows. Sequential Hashing [134] follows a design enumeration are inefficient for traffic measurement because of similar to that of Reversible Sketch [139]; however, it hashes space restrictions and real-time response requirements. Many key prefixes of different lengths into multiple smaller sketches. studies have proposed solutions to this problem using bit The update costs of both Reversible Sketch and Sequential count, sub-key count, and IBF-based or other methods. Hashing increase with the key length. Direct key recording. LD-sketch [125] maintains a two- IBF variants. FlowRadar [127] takes an extension of IBF dimensional array of buckets and associates each bucket with [128] as the encoded flowset. Each counting table consists of an associative array to track the candidate heavy flows that three fields, namely FlowXOR, FlowCount, and PacketCount, are hashed to the bucket. However, updating the associative which record the XOR of all flows, the number of flows, array involves high memory access overhead, which increases and the number of packets of all flows mapped in the bucket, with the number of heavy flows. In particular, LD-sketch respectively. By finding the bucket that includes just one occasionally expands the associative array to hold more can- flow (called a pure cell), FlowRadar can decode the flow didate heavy flows; nevertheless, dynamic memory alloca- by leveraging hash functions to locate the other cells of this tion involves high costs and is challenging to implement in flow and remove it from all the cells (by XORing with the hardware. MV-sketch [116] also enforces each bucket with FlowXOR fields, subtracting the PacketCount, and decrement- a flow key field to record the candidate heavy flow and an ing the FlowCount). The decoding process will continue to indicator counter to implement the majority vote algorithm find new pure cells until there is no pure cell. (MJRTY). Both LD-sketch and MV-sketch detect heavy key by Deltoid [68], Reversible Sketch [139], Sequential Hashing directly recording the candidate flow keys. Nevertheless, direct [134], Fast Sketch [69], SketchLearn [22], and BitCount recording is not a desirable strategy for per-flow measurement. [162] are all bit-level or sub-key (multi-bit)-level sketches Bit reversibility. BitCount [162] maintains a counter for designed for reversibility to deduce flow keys directly. They each bit in the IP address to count the total number of 1-bits. first iteratively check the collected sketch for bits or sub-keys It can recover the heavy hitter keys by combining the bits to identify the candidate flow sub-key for measurement tasks. whose counter is larger than a given threshold. Meanwhile, The candidate flow key is then composed of these sub-keys Deltoid [68] comprises multiple counter groups with 1+L and further examined with the information recorded by the counters each (where L is the number of bits in a key), in entire sketch to form the final measurement result. which one counter tracks the total sum of the group and 2) Probability-theory-based techniques: the remaining L counters correspond to the bit positions of In contrast to the traditional counter estimation techniques, a key. It maps each flow key to a subset of groups and some methods abstract counters as random variables and increases the counters whose corresponding key bits are 1. leverage probability-theory-based techniques, such as maxi- To recover heavy flows, Deltoid first identifies all groups mum likelihood estimation and Bayesian estimation, to obtain whose total sums exceed the threshold. If each such group the estimated counter of the flow. By formulating the probabil- has only one heavy flow, the heavy flow can be recovered: the ity models of the related variables, sketch methods can deduce bit is 1 if a counter exceeds the threshold and 0 otherwise. the desired statistical properties of the flow traffic. Fast Sketch [69] is similar to Deltoid [68], except that it Bayes’ Theorem. Multi-Resolution Array of Counters maps the quotient of a flow key to the sketch. An incoming (MRAC) [163] first uses counting algorithms and Bayesian flow is first added to the first counter in the hashed rows. estimation for accurate estimation of the flow size distribution. Moreover, the quotient function determines the counters in It increases the underlying counter for each incoming packet each row to add the flow size. After checking the rows one in a lossy data structure and infers the most likely flow size by one and recovering the quotient bits that contribute to the distribution from the observed counters after the collision. heavy change, Fast Sketch can identify the heavy changes Bayesian statistics are used to recover as much information using the inverse function according to the quotient technique. from the streaming results as possible. SketchLearn is the first SketchLearn [22] combines the sketch design of Deltoid [68] approximate measurement approach to build on the charac- with automated statistical inference. Specifically, it maintains terization of the inherent statistical properties of sketches. It a data structure similar to Deltoid and provides a fundamental abstracts the kth bit of a flow key as a random variable k. property of bit-level Gaussian distributions. It proves that if Further, p[k] represents the probability that the kth bit is equal there is no large flow, the bit counter values will follow a to 1. The multi-level sketch provides a fundamental property, Gaussian distribution. SketchLearn extracts large flows from i.e., if there is no large flow, its counter values follow a Gaus- sian distribution. Based on this property, SketchLearn extracts SSS [120] assumes that the data stream obeys a Zipfian distri- large flows from multi-level sketches and leaves the residual bution, employs historical data to learn the parameters of the counters for small flows to form Gaussian distributions. distribution using an ML program, and sets thresholds based Maximum likelihood estimation. MLM sketch [109] ab- on the distribution function. Specifically, it leverages the linear stracts the counter in the storage vector of flow f as a random regression model in the ML program to learn the threshold variable X, which is the sum of Y , i.e., the portion contributed used to identify hot items. The intelligent SDN-based Traf- by the packets of flow f, and Z, i.e., the portion contributed by fic (de)Aggregation and Measurement Paradigm (iSTAMP) the packets of other flows. Based on the probability distribu- [166] employs an intelligent sampling algorithm to select tion of Y and Z, MLM sketch formulates the probability of the the most informative traffic flows using information collected observed value of a counter. Moreover, the estimated size of throughout the measurement process. The main goal is to flow f can be obtained by maximizing the likelihood function. adaptively track and measure the most rewarding/informative Persistent items Identification and Estimation (PIE) [164] uses traffic flows, which, if measured accurately, can yield the the information recorded at the observation points during all best improvement in the overall measurement utility. For this the measurement periods, formulates the occurrence estimation purpose, multi-armed bandit sequential resource allocation problem as a maximum likelihood estimation (MLE) problem, algorithms are used to adaptively sample the most “rewarding” and uses the traditional MLE approach for estimation. It flows in order to improve the flow measurement performance. divides m cells in the Space-Time Bloom Filter (STBF) of Learning cardinality estimation. To adapt to changes in the measurement period i into empty, singleton, and collided flow size distribution, Cohen and Nezti [169] proposed a cells, the numbers of which are represented by Zi0, Zi1, and sampling-based adaptive cardinality estimation based on the ZiC , respectively. By maximizing the likelihood function, the online ML method. The traffic flow stream was divided into number of distinct items can be estimated. batches of packets. Each batch was sampled, the selected 3) Machine learning techniques: features and statistical properties were extracted from these The statistical traffic properties change from time to time. samples, and the exact cardinality of each batch was calcu- Owing to this uncertainty, the traditional sketch design and lated. Then, these values were used to train online ML models. optimization are not suitable for various measurement tasks, Moreover, they analyzed various possible features, parameters, which leads to a considerable burden in terms of redesigning and online ML algorithms for their framework and proposed and tuning up the sketch design. Recently, several studies the most suitable combination. have combined machine learning (ML) techniques with a Learning frequency estimation. Existing learning fre- measurement framework to relieve or eliminate the binding of quency estimation involves two strategies. On the one hand, traffic flow characteristics and the sketch design. By selecting [81] and [82] first learned a classifier to separately record the a set of features from the sketch itself, where every feature heavy and mouse flows in order to reduce the mutual interfer- is a property of the data contributing to the estimation, ML ence collision. On the other hand, [79] and [80] leveraged the techniques can reformulate the measurement tasks with these concept of locality sensitivity to cluster similar items together features instead of the measured sketch and thus reduce the to reduce the approximate error. By averaging the recorded design complexity drastically. frequency, we can derive the approximate result. ML sketch framework. MLsketch [165] presents a gener- Learning membership testing. Although BF has been well alized ML framework to reduce the dependence of the sketch explored and evaluated in both academia and industry, Kraska accuracy on the network traffic characteristics. It continuously et al. [154] suggested that ML models such as neural networks retrains ML models using a small number of samples from can be combined with BF to improve space utilization and the same traffic whose information is being stored in the query efficiency further. They proposed a method in which sketch. Thus, ML models can continue to adapt to any ML is used to model a pre-filter stage for the backup BF. network traffic characteristic variations without requiring to Michael [155] further clarified a formal model of this structure manually foresee scenarios or design statistical techniques and stated what guarantees can and cannot be associated for different scenarios. The ML model can be trained using with such a structure. The sandwiched LBF [170] optimizes the learning sketch and extracted features, and it reflects a BF by surrounding the learning model with two BF layers. mapping from the current traffic distribution to the metrics Specifically, the backup BF removes the false negatives from of interest. This design is applicable to different network the learning model, while the initial BF removes the false pos- environments. Furthermore, designing the features and ML itives upfront. The Partitioned Learned Bloom Filter (PLBF) model usually involves just a selection from the available [171] was developed to overcome the inability of previous features and models, which is much simpler than designing methods to exploit the learning model fully. It partitions the a new sketch algorithm. score space into multiple regions using multiple thresholds and Learning assistant parameter setting. To accurately evalu- employs an individual backup BF for each region. Experiments ate the frequency and rank of the top-k hot items, the thresh- have demonstrated that PLBF can achieve near-optimal perfor- olds that are used to determine the type of each item (cold mance in many cases. Based on the Stable Bloom Filter (SBF) items, potential hot items, or hot items) should be dynamically [172], the Stable Learned Bloom Filter (SLBF) [173] replaces adjusted to adapt to the flow distribution. To address this issue, BFs in the sandwiched LBF [170] and PLBF [171] with SBF to TABLE V: General methods

Frequency HH Heavy Change Entropy Super Spreader/DDos Flow size distribution Cardinality

OpenSketch [167] × XX × XX × Univmon [140] × XXXX × × SketchVisor [168] × XXXXXX General purpose SketchLearn [22] XXXXXXX HeavyGuardian [119] XXXX × X × Elastic Sketch [158] XXXX × XX MLsketch [165] XX × × × × X HyperSight [150] × XX × × × ×

TABLE VI: Persistence detection

Methods Enable sliding windows Store the whole ID Multi-structure False positive False negative

PIE [164] X × X × X Persistence detection LDF [174] X × XXX Lahiri et al. [175] XX × XX Huang et al. [176] X × XX × LTC [177] XX × XX facilitate data streaming applications. All the above-mentioned Persistence is typical of many stealthy types of traffic on the learning sketches face a common problem in terms of updating Internet. For example, Xiao et al. [19] observed that legitimate the out-of-date classifier. users connect to a server intermittently, whereas attacking hosts keep sending packets to the server. By identifying C. Summary and insights persistent items, operators can detect potentially malicious In this section, we reviewed studies related to optimiza- traffic behavior. Table VI compares existing persistence detec- tion in the post-processing stage. The linearity of CM-based tion methods in terms of their data structure characteristics, sketches provides an intrinsic characteristic for sketch com- functionality, and measurement accuracy. pression and merging. [157] and [160] proposed feasible solutions for problems related to other sketches, such as Multi-layer design. Long-Duration Flows (LDFs) [174] cardinality registers and top-k summaries. Furthermore, we maintain two counting Bloom filters (B1 and B2) and one surveyed techniques for information extraction, such as flow small hash table simultaneously. At the starting time interval, key reversible, probability-theory-based, and ML techniques. LDF records all the flows that appear during this interval in The information extraction techniques are directly related to B1. At the next time interval, LDF adds new flows, which the sketch structure and updating operation. Other information appear in both B1 and the current interval into B2. Then, extraction techniques will be proposed as novel sketch designs. B1 and B2 are iterated by switching roles in the subsequent intervals. When a flow is identified as LDF, it will be added to VI.APPLICATION AND IMPLEMENTATION the hash table. Lahiri et al. [175] presented the first small-space We have surveyed related studies on the sketch design from approximation algorithm for identifying persistent items. A the perspective of data flow. In the following, we survey hash-based filter first processes the items in each measurement related studies from another perspective, i.e., application and period to record the flow ID and track the persistence in future implementation of sketches. In total, we have summarized 13 periods. Furthermore, they considered the sliding model under measurement tasks in terms of time and volume dimension. which only items belonging to the W most recent windows Furthermore, we survey the software and hardware implemen- are considered. PIE [164] uses Raptor codes to encode the ID tation when using sketches in practice. of each item in a flow traffic stream. The persistent item can be retrieved and decoded from the measurement point if there A. Measurement tasks are sufficient encoded bits for the ID in sufficient measurement Although the methods listed in Table V are designed for periods. Specifically, PIE uses STBF to record information. supporting a comprehensive range of measurement tasks, there During each epoch, PIE maintains a new STBF and transfers are some measurement tasks that they cannot support. To it to the permanent storage region for subsequent analysis. this end, many studies have designed sketches for specific After T epochs, PIE uses MLE to estimate the number of or several measurement tasks. In this subsection, we review occurrences of any given item. However, PIE must maintain the related measurement tasks in terms of time, volume and T STBFs to track persistent items with high space overhead. time-volume dimension. On-off sketch [178] introduces a state flag (on-off) into each 1) Persistence detection: In contrast to a frequent item, a counter of sketch instead of maintaining a bloom filter to persistent item does not necessarily occur more frequently than record the existence of element. At the beginning of each other items over a short period. Instead, it persists and occurs window, the flag is set as on, once an element is hashed into more frequently than other items only over a long period. the counter, the flag turn as off and the counters are only TABLE VII: Latency detection

Methods Time stamp Probe packet Per-packet delay Deal with loss and reordering Abnormal delay detection LDA [124] × × × × × Colate [180] × × X × X Flow latency OPA [181] × × XXX Marple [179] X × XXX FineComb [182] × × × partially × RLI [183] × X × X × increased when the flag bit is on. In this way, the counter is LDA [124], both the sender and the receiver maintain several most increased once in a time window. counter vectors composed of counter pairs, the timestamp k out of t persistence. Although we can separate attackers counter for accumulating packet timestamps, and the packet from legitimates by finding flows appearing during all mea- counter for counting the number of arriving/departing packets. surement periods, Huang et al. [176] considered the situation For each packet, LDA first maps the packet into the counter in which attackers might forgo sending packets during several vector with a sampling probability. In the long run, the packet periods. To solve this problem, they formulated a new problem is mapped into a counter-pair, the timestamp of the packet called k-persistent spread estimation, which measures persis- is added to the timestamp counter, and the packet counter is tent traffic elements in each flow that appear during at least incremented by one. LDA selects all the counter pairs with the k out of t measurement periods. Yang et al. [177] realized same packet counter for both the sender and the receiver by that for some applications, people care about essential items checking the packet counters. Then, LDA can easily calculate that are both persistent and frequent. They formulated this the total number of successfully delivered packets and the sum problem and proposed LTC to solve it, including the Long- of their timestamps. However, in LDA, packets belonging to tail Replacement technique and modified CLOCK algorithm. one segment may be misidentified with other segments due 2) Flow latency: Delay is an essential metric for under- to packet reordering. Thus, these groups of packets cannot be standing and improving system performance. When managing used for delay measurement. networks with stringent latency demands, operators must mea- FineComb [182] proposes a data structure called stash, sure the latency between two observation points, such as a port which maintains the information for packets near the boundary in a middlebox and a network card in the end host. Moreover, of the segments at the receiver to solve this problem. However, as reported in [179], both fine-grained and coarse-grained for groups with lost packets, FineComb cannot calculate their delay measurement are of great importance in performance per-packet delay. Order Preserving Aggregator (OPA) [181] measurement, system diagnosis, traffic engineering, etc. leverages the fact that lost and reordering packets are usu- For per-flow latency measurement. MAPLE [179] attaches ally much fewer than legitimate packets to facilitate efficient a timestamp to every packet. Specifically, when measuring ordering as well as loss of information representation and the packet passing through observation point S to R, S recovery. It proposes a two-layer design to convey ordering and attaches a timestamp to the packets, and R calculates the timestamps and efficiently derive the per-packet delay. How- packet’s latency from S to R by subtracting the timestamp ever, compared with LDA [124] and FineComb [182], OPA from its current time. To reduce the space required to store improves the accuracy of delay measurement with additional all the packets’ latency values, MAPLE predefines a set of overhead for a high data link. latency values and maps each packet to its closest value. As illustrated in Table VII, we borrow the idea from OPA For each latency value, MAPLE augments a Bloom filter to [181] and compare flow latency measurement methods in test whether a given packet has been mapped to the latency terms of their functionality, performance, and the basic idea. value. Reference Latency Interpolation (RLI) [183] uses packet 3) Per-flow measurement: Providing per-flow statistics probing. Specifically, S inserts probe packets with a timestamp plays a fundamental role in network measurement. As a into the flow, and R calculates the latency of each probe probabilistic data structure, sketches have been extensively packet similarly to MAPLE. Suppose that the latency of two investigated for per-flow measurement, including flow size (the probe packets has been calculated as l1 and l2; to calculate number of packets in a flow) and flow volume (the number of the latency of the regular packets between two probe packets, bytes in a flow). R uses the straight-line equation to calculate the two probe Table VIII summarizes related studies on per-flow mea- packets’ latency. Further, Colate [180] is the first per-flow surement. Apart from the CM sketch [27] and its variants, latency measurement scheme that requires no probe packets or most research focuses on the skewness characteristic of the timestamp. It records the time information of packets at each traffic stream. Many methods have been proposed to solve observation point and purposely allows noise to be introduced this unbalance situation in terms of the traffic size/volume. in the recorded information to minimize storage. We have discussed the related work in detail in Section IV. For aggregate latency. The aggregate latency refers to 4) Cardinality estimation: Estimating the number of dis- the average and deviation of the latencies experienced by tinct flows (or cardinality) is an essential issue in many all the packets that pass through two observation points. In applications, such as measurement and anomaly detection. TABLE VIII: Per-flow measurement

Functionality Methods Conservative update [108], CSM Sketch [109] CountMin [27] and its variants Count MeanMin sketch [184], Count sketch [83] Variable counters size SBF [98], DCF [141], TinyTable [101], CQF [102]

Per-flow measurement Enlarge count range SAC [85], Cuckoo counter [107], SA [84] Counter Braids [24] Brick [99], RCC [106], Pyramid Sketch [94], Adaptive to skewed traffic Multi-level counters CounterTree [93], OM sketch [95], ABC [104], Cold filter [64], Diamond sketch [96] Multi-layer sketch Bloom Sketch [97], ASketch [63] Frequency-aware updating FCM [122]

TABLE IX: Cardinality estimation [76]

Methods Std.Err.(δ) Memory units √ MinCount [185] 1.00/√m 32-bit keys PCSA [73] 0.78/√m 32-bit register LogLog [74] 1.30/√m 5-bit register Cardinality estimation HLL [75] 1.04/√m 5-bit register Sliding HLL [77] 1.04/√m 5-bit register HLL-TailCut+ [76] 1.00/√m 3-bit register Refined LL [186] 1.00/ m 5-bit register Cohen and Nezti [169] −− −−

As shown in Table IX, we borrow insights from the well- metric hash functions with a smaller common ratio, Refined known cardinality estimation in [76], where m is the total LogLog overcomes the restriction that the former method can number of memory units. Linear Counting [92] maps all only record 2k, resulting in a large gap with the actual number. flows uniformly among a bit array, and each element can be Moreover, in real-world networks, we do not have cardinality, mapped to a bit. However, it cannot work efficiently under which would enable us to choose the appropriate size of a strict condition where the memory is roughly linear to bitmaps. A self-adaptive version of LogLog was proposed to the cardinality [72]. PCSA [73] uses stochastic averaging, adapt to cardinality by changing its common ratio dynamically. which allocates multiple registers to produce independent It achieves high accuracy when the cardinality is small and can estimations, and returns the average. However, to achieve high also store large cardinality. accuracy, m should be large, and m log n bits is the cost of the Cohen and Nezti [169] proposed a sampling-based adaptive limited memory in the monitors [76]. This limitation of PCSA cardinality estimation. The traffic flow stream is divided into motivates another more efficient algorithm called LogLog [74]. batches of packets. Each batch is sampled, and the selected Both LogLog [74] and HLL [75] are based on the pattern features are extracted from the samples. The features are then “0R1” in the binary representation of hashed values. More passed to the predict() operation of the online ML algo- specifically, after hashing all the packets, the position of the rithm, which returns an estimation of the batch’s cardinality. leftmost 1-bit is denoted by R. However, they adopt different Subsequently, the entire batch is sent for training. In the methods for extracting the estimation from a set of registers. training phase, partial fit() is provided with the batch’s set LogLog uses geometric averaging whereas HLL uses harmonic of features and its cardinality. The exact cardinality of a batch means to mitigate the impact of outliers with abnormally can be calculated using a simple hash table. large estimations. Sliding HLL [77] improves HLL for stream 5) Super spreader: Spreader estimation aims to identify processing by adding a sliding window mechanism. HLL- a source IP that communicates with more than a threshold TailCut+ [76] can further reduce the memory requirements number of distinct destination IP/port pairs. of HLL by 45% on the basis of an extended tail cutting technique that compresses the information across all registers In this case, the flow is identified by the source address. while reducing the variance among the registers. Its basic idea The elements under measurement are the destination addresses is that the memory per register may be compressed into four in the headers of the packets. An increase in the cardinality bits or even into three bits with no significant loss of useful of a specific flow may signal a DDoS attack against the information. HLL-TailCut+ can get many registers, and a more flow’s destination address. DDoS is a malicious attempt to accurate estimation result given the same space overhead. disrupt regular traffic of a targeted server, service, or network by overwhelming the target or its surrounding infrastructure Refined LogLog [186] uses a fine-grained common ratio with a traffic flood. In this case, a flow is identified by the instead of 2 used in FM Sketch [73] and LogLog [74] to destination address, and we must count the number of distinct narrow the gaps between two recordable values. Using geo- source addresses in each flow. TABLE X: Super Spreader/DDos

Methods Basic idea One/two level Filtering [187] Set sampling probability to filter OSM [188] Share bitmaps Super Spreader/DDos MultiResBitmap [71] One bitmap one source CSE [90] Bit sharing Virtual Register Sharing [89] Multi-bits sharing RCD sketch [135] Remainder characteristics of number theory

Table X compares the studies in terms of the basic idea. detection, identifying heavy-hitter flows is crucial [190]. One-level filtering [187] samples distinct source-destination CM sketch and Count-min heap [27] can detect a heavy pairs in the packets such that each distinct pair is included hitter by checking whether a flow is a heavy hitter by ex- in the sample with probability p= 1 to identify the k-super k amining if its estimated sum exceeds the threshold. However, spreaders. Two-level filtering [187] maintains two levels of CM Sketch is non-invertible. Lossy Counting [191] treats all filters. The first level filters out the SrcIP that contacts only incoming flows as elephant flows. If the flow’s counter exists, a small number of distinct DstIP with a small sampling LC increases the corresponding counter on every arrival; if probability, which is smaller than that of the second filter. the counter is not in the table, it is allocated a counter Compared with one-level filtering, two-level filtering is more value of 1 if available. LC keeps the table size bounded space-efficient. Both levels can capture the real super spreaders by periodically decreasing table counters and evicting items and the SrcIP that contacts with a few DstIP can be filtered whose counter reaches 0. However, LC requires a maximum out efficiently in the first level. By embedding estimated fan- number of 1 log(n) table entries. Probabilistic Lossy Counting outs into binary hash strings, multiple Spread Sketch [189]  (PLC) [192] requires fewer table entries on average but only can be merged to provide a network-wide measurement view provides a probabilistic guarantee. PLC makes the error bound for recovering super spreaders and their estimated fan-outs. significantly smaller than the deterministic error bound of MultiResBitmap [71] uses bitmaps to save space instead of Lossy Counting [191]. In lossy counting, the error bound storing the actual source/destination addresses in each sample. reflects the potential error in the estimated frequency of an It allocates a bitmap, where each bit is set for each destina- element owing to possible prior element removal(s) from the tion/source that the source/destination contacts. However, such table. PLC shows that the error of more than 90% of the flows a spread estimator cannot fit in a tight space where only a few is significantly smaller than the deterministic error bound. bits are available for each source [89]. OSM [188] allocates Using a probabilistic error bound instead of a deterministic each source randomly to l columns in a bit matrix (where error bound to remove flows in a finished epoch, PLC improves columns are bitmaps) through l hash functions. Moreover, a the memory consumption of the algorithm. Furthermore, PLC column may be shared by multiple sources. To reduce the reduces the false positives of lossy counting and achieves a introduced noise, OSM proposes a method to remove the noise low estimation error, albeit slightly higher than that of LC. and estimate the spread of the source. Nevertheless, on the Caching Heavy-hitter Flows with Replacement (CHHFR) one hand, OSM makes l memory accesses and uses l bits [190] leverages LRU to track heavy hitters. It not only focuses for storing one contact; on the other hand, the noise may be on the update time of the traffic flows but also adds another too much to be removed in a compact memory space when a variable Ctr to find a replaced item with an underlying double- significant number of all the bits are set. CSE [90] also creates linked list of flow units to record the frequency of items. MV- a virtual bit vector for each source by taking bits uniformly at sketch [116] is designed for supporting heavy flow detection random from a common pool of available bits. This sharing with small and static memory allocation. It tracks candidate strategy provides an excellent property: the probability that one heavy flows inside the reversible sketch data structure via source’s contacts cause noise in any other source is the same. the idea of majority voting. LD-sketch [125] combines the Compared with the noise in [188], such uniform noise is easy traditional counter-based and sketch-based methods to reduce to measure and remove. Virtual Register Sharing [89] develops false positives by aggregating multiple detection results. By a framework of virtual estimators that allow sharing of the augmenting an associative array in each bucket, LD-sketch cardinality estimation solutions, including PCSA [73], LogLog can track heavy key candidates that are hashed to the bucket. [74], and HyperLL [75]. Moreover, it shows that sharing at the BitCount [162] maintains a set of counters to count the total multi-bit level is superior to sharing at the bit level. RCD [135] number of 1-bits in each bit position of the binary repre- locates hosts with a large connection degree by estimating the sentation of IP addresses. However, owing to the additional in-degree/out-degree associated with a given host through the property of bit counters, this method may involve inevitable remainder characteristics of number theory. false positive errors. IM-SUM [8] first introduces the elephant 6) Heavy Hitters: The flows which are identified as heavy detection scheme, which focuses on the elephants in terms of hitters are the ones that exceed a specified volume of the total volume. It achieves constant update time consumption, i.e., an traffic. In many applications, such as congestion and anomaly amortized time of O(1), by periodically removing many small TABLE XI: Heavy Hitter detection

Methods Space Updating time Query time  H 1  1  CM sketch [27] O  log δ O log δ O (n) 1 1   H  Count-min heap [27] O  log δ + H log n O log δ O (H) CHHFR [190] − − − − − − − − − 1 1  1  1 2 1  MV-Sketch [116] O  log δ log n O log δ O  log δ Heavy hitter  H 1  1   H 1  LD-sketch [125] O  log δ O log δ O  log δ 1  1  LC [191] O  log (n) O (1) O  log (n) 1  1  PLC [192] O  log (n) O (1) O  log (n) 1  BitCount [162] O  · k O (k) O (k)  H 1  1   H 1  Sequential Hashing [134] O  log n log δ O log n log δ O  log n log δ 1  1  IM-SUM [8] O  O (1) O  DIM-SUM [8] O 1  O (1) O 1        H 1 n 1 n H 3 1 n Fast sketch [69] O  log δ log 1 O log δ log 1 O  log δ log 1 H log δ H log δ H log δ

TABLE XII: Hierarchical Heavy Hitters

Methods Space Updating time Enable mHHH  H2  Separator [195] O  O (H) × 2  TCAM-based HHH [196] O T O (H) × HHH Recursive Lattice Search [193] −− O (N log H) X  H  Randomized HHH [58] O  O (1) X  H3/2   H3/2  Truong et al. [194] O  O  X  H log(N)  Cormode et al. [197] O  O (H log (N)) X  H  Mitzenmacher et al. [198] O  O (H log N) X

flows from the table at once instead of removing the minimum time of O(1) for each packet. Recursive Lattice Search [193] flow upon the arrival of a non-resident flow. However, IM- revisits the commonly accepted definition of HHH and applies SUM [8] performs only a small portion of the maintenance Z-ordering [199] to use a recursive partitioning algorithm. procedure to achieve a worst-case complexity of O(1). The Z-order makes the ordering consistent with the ancestor- Table XI compares various heavy hitter detection methods descendant relationship in the hierarchy, and it transforms the in terms of their space overhead, updating time, and query HHH problem into simple space partitioning of a quadtree. time. In the table, H represents the maximum number of Table XII compares the HHH detection methods according heavy hitters in a measurement epoch, n is the traffic flow to their space consumption and updating time. Specifically, key domain, (0<<1) denotes the approximation parameter, N represents the total number of packets, H is the size of and δ(0<δ<1) is the error probability. the hierarchy, and  denotes the allowed relative estimation 7) Hierarchical Heavy Hitters (HHH): The structure of error for every single flow’s frequency. For TCAM-based HHH IP addresses implies a prefix-based hierarchy. Network traffic [196], T is a threshold. The flows that consume at least T of characteristics can be better presented by using flow aggregates the link capacity in each interval can be defined as heavy by their 5-tuple fields [193]. This strategy makes aggregation hitters, and space refers to the TCAM matching rules. of flows a powerful means for traffic measurement and a valu- 8) Top-k detection: Top-k flows detection is a critical task able component for anomaly detection to identify attacks and in network traffic measurement, with many applications in scans. Moreover, identifying pairs of source and destination congestion control, anomaly detection, and traffic engineer- prefixes or other attributes that give rise to a significant amount ing [118], including data mining, , network traffic of global traffic refers to a novel measurement task, namely measurement, and security [120]. Table XIII summarizes the multidimensional Hierarchical Heavy Hitters (mHHH) [194]. related methods and their basic ideas. Separator [195] consists of H stream-summary structures, Sketch-based methods such as CM Sketch [27] and CSketch each corresponding to a level of the domain tree in the [83] record the packet number of all flows. By augmenting HHH detection problem. It combines bottom-up and top- an additional heap, these methods can report the top-k heavy down processing strategies to detect HHH. An incoming item flows. However, they also record the frequencies of mice. Such updates the first-layer counters or higher-layer counters if there information is useless, detrimental to finding the top-k heavy is no space available or no matched counters in the lower flows, and requires additional memory usage [120]. layer. Randomized HHH [58] proposes a randomized constant- To eliminate small flows and add suspected large flows, time algorithm for HHH, which achieves a worst-case update MG counter [123] performs O(k) operations to update k TABLE XIII: Top-k heavy detection

Methods Basic idea SS [113] Evict the item with the minimal value FSS [142] Augment a sketch to filter items SSS [120] Insert the item according to scoring CSS [149] Require no pointer and use statistical memory WCSS [149] Point queries in O(1) under sliding model Top-k Frequent [112] Evict the item by decreasing all the items RAP [111] Randomized admission policy CountMax [115] Lightweight and cooperative measurement HeavyKeeper [118] Count with exponential decay HashPipe [137] Use emerging programmable (P4) Discrete Tensor Completion [200] Data plane implementations counters in a hash table; the overhead becomes significant efficient system with incomplete measurement data as a result when there are many mice to be kicked out. Pandey et al. of sub-sampling for scalability or missing data. [201] show that by using modern storage devices and building HashPipe [137] is an algorithm for tracking the k heaviest upon on recent advances in external-memory dictionaries, MG flows with high accuracy within the features and constraints leveled external-memory reporting tables (LERTs) can process of programmable switches. HashPipe is heavily inspired by millions of stream events per second. Space Saving [113] and Space Saving [113]. It maintains both the flow key and the Frequent [112] both treat each incoming packet f as a sus- frequency in a pipeline of hash tables to track heavy flows by pected hot item. They assign a high frequency to f and insert evicting lighter flows from the switch memory over time. it into the top-k data structure. Further, they gradually evict the 9) Entropy estimation: As a critical metrics, entropy has at- mice from the top-k data structure with a certain probability tracted considerable attention for network measurement [132]. as time passes. However, in real traffic traces, most items are Using the entropy of traffic distributions, we can perform mice, and processing these mice involves additional overhead a wide variety of network applications such as anomaly and makes the ranking and frequency estimation of the top- detection, clustering to reveal interesting patterns, and traffic k flows inaccurate. FSS citeDBLP:journals/isci/HomemC10 classification [202]. IMP [132] is the first method to measure uses a filtering approach to improve Space Saving. It uses the entropy of the traffic between every origin-destination pair a bitmap to filter and minimize updates on the top-k data (OD-pair flows). Moreover, IMP can measure the entropy structure and estimate the error associated with each element. of the intersection of two data streams, A and B, from SSS [120] adds a small queue and a Scoreboard to insert the sketches of A and B, which are recorded from two only the elephants into Stream Summary instead of all the measurement points. Lall et al. [202] contributed to the inves- items. The scoreboard is responsible for recording potential tigation and application of streaming algorithms to compute hot items to indicate whether an item is an elephant item. the entropy over network traffic streams. They also identified CSS [149] redesigns Space Saving and achieves a reduction the appropriate estimator functions for calculating the entropy in space overhead of up to 85%; it works at O(1) with accurately and provided proof of approximation guarantees high probability. Moreover, WCSS is the first algorithm that and resource usage. AMS-estimator [203] designs an algorithm supports point queries with constant time complexity under the for approximating the empirical entropy of a stream of m sliding window model. This property makes WCSS a practical values in a single pass, using O −2 log δ−1 log m words choice for matching the line rate. of space, which is near-optimal in terms of its dependence on CountMax [115] and MV-sketch [116] adopt the majority , where  is the approximation parameter and δ represents vote algorithm (MJRTY) [117] to track the candidate heavy the error probability. Defeat [129] uses the entropy of the flow in each bucket. Moreover, it has loose bounds on the empirical distribution of each feature to detect unusual traffic estimated values of the top-k flows. Compared with SSS patterns, where the traffic feature represents an entry in a [120], CountMax consumes only 1/3-1/2 computing over- packet header field. head and reduces the average estimation error by 20%–30% 10) Change detection: A sudden increase in network traf- under the same memory size. RAP [111] is a randomized fic owing to the emergence of elephant flows is essential admission policy for allocating counters to non-measurement for network provisioning, management, and security because flows. More specifically, this strategy ignores most of the tail significant patterns often imply events of interest [134]. For flows and can still admit the elephants eventually. Heavy- example, the beginning of a DDoS attack or traffic rerouting Keeper uses the strategy from HeavyGuardian [119], i.e., owing to link failures will cause traffic to change heavily. A count with exponential decay. Specifically, HeavyKeeper heavy changer is defined as a flow that contributes more than a only keeps track of a small number of flows. Mice entering threshold of the total capacity over two consecutive intervals. the data structure will decay away to make room for the To detect the significant changes in the two consecutive true elephants. Discrete Tensor Completion [200] solves the time intervals, Fast Sketch [69], MV-sketch [116], LD-sketch challenging problem of inferring the top-k elephants in an [125], Modular Hashing [139], Deltoid [68], Reversible k- TABLE XIV: Heavy change detection

Methods Memory Updating time Query time False positive False negative n  n  n  Fast sketch [69] O k log k O log k O k log k XX 2 MV-Sketch [116] Θ (k log n) O (k) O k XX n  n  Group testing [205] O k log k O (log n) O k log k X × 1/2 2 Heavy change Defeat [129] O n O (1) O k X × Sequential Hashing [134] Θ (k log n) Θ (log n) Θ (k log n) XX  Θ(1)   1  (log n) log log n Reversible k-ary sketch [204] Θ log log n Θ (log n) O kn log log n XX Deltoid [68] Θ (k log n) Θ (log n) O (k log n) XX 2  LD-Sketch [125] Θ k log n O (k) O (k) X ×  1     1  log log n log n log log n Modular hashing [139] O n log log n O log log n O kn log log n XX ary sketch [204], and Group testing [205] all derive the finding strategies, such as frequent [146], probabilistic decay difference sketch Sd= |S2−S1|, where S1 and S2 are the [119] and probabilistic replacement [111]. The idea behind sketches recorded for the two consecutive time intervals. the Snapshotting is that a burst can be described as a sudden Because of the linearity of sketches, the difference in flows can increase and a sudden decrease. Compared with PBE [206], be estimated by subtraction. For any key whose value recorded BurstSketch care more about real-time burst detection in high- in Sd exceeds the threshold, we can conclude that this flow speed data streams. is a suspected heavy changer and proposed as a set of heavy 12) Persistent sketch: A persistent sketch, also known as a change flows. Furthermore, Deltoid [68] refers to a flow as multi-version data structure in the literature, is a data deltoid if it has a large difference, including absolute, relative, structure that preserves all its previous versions as it is updated or variational, instead of only absolute difference. MV-sketch over time [207]. By maintain such a data structure, network [116] and LD-sketch [125] both record the flow ID to achieve operators can perform historical window queries and historical reversibility. Fast Sketch [69], Modular Hashing [139], and queries on the whole history data instead of just a measurement Reversible k-ary sketch [204] are all reversible sketch designs period. Wei et al. [207] introduce Piecewise Linear Approxi- that adopt different strategies to identify the heavy changer. mation into basic sketch structure to support historical window They separately record the volume information of bit or bits. queries and historical queries. They propose persistent CM By identifying the heavy bit/bits changer, these methods can Sketch and persistent AMS Sketch, which leverage piecewise- reverse the heavy change flows. linear functions to approximate each counter of the sketch over Interestingly, HyperSight [150] exploits packet behavior time. Persistent Bloom Filter [209] decomposes the time into changes in response to network incidents achieving both high a binary representation of temporal range and maintains a BF measurement coverage and high accuracy. By applying a for each to support historical window queries and historical series of Bloom filters, it checks whether the corresponding queries. packet behaviors have been changed. Moreover, HyperSight 13) Item batch detection: Chen et al. [210] define item designs a packet behavior query language to express multiple batch as a consecutive sequence of identical items that are measurement queries based on the packet behavior change close in time in a data stream. They argue that item batch is primitive to support a wide range of network event queries. a useful data stream pattern in the cache, burst detection, and Table XIV summarizes existing heavy change detection APT detection. They first propose ClockSketch to delete the methods in terms of their memory overhead, updating time stale information as much as possible while maintaining all complexity, and query time complexity, where n represents the items visited within the latest time window. range of flow keys and k is an upper bound on the number of anomalous keys of interest. Moreover, we focus on the false B. Hardware or Software implementation positive and false negative errors involved in these methods. As the Internet enters the era of big network data, traffic 11) Burst detection: Paul et al. [206] define burst as the flow may be massive in some essential scenarios. Modern acceleration over the incoming rate of an event mentioning. high-speed routers forward packets at the speed of hundreds Based on Persistent CM sketch (PCM) [207], they explore the of Gigabits or even hundreds of Terabits per second [89]. The efficiency and approximation quality tradeoff in the design of number of traffic flows that traverse a core router may be of data summaries. They further present PBE to find burst events the order of tens of millions. Moreover, the number of bytes at any time for the entire history. Zhong et al. [208] refer per flow can also be large. Simultaneous tracking of such a to a sudden increase in terms of arrival rate followed by a large number of flows is a unique challenge for traffic mea- sudden decrease as a burst. They propose BurstSketch first to surement hardware and software. This subsection introduces select potential burst items using Running Track technique ef- some methods that focus on the measurement implementations ficiently and then monitor the potential burst items and capture of hardware and software in the passive measurement area. the key features of burst pattern by Snapshotting technique. 1) Implementation in SRAM: To maintain high throughput, Specifically, Runnning track can be most frequent elements routers forward packets from incoming ports to outgoing ports via a switching fabric, bypassing the main memory and CPU. Counter-sharing strategy. CSM sketch [109] and Counter If the measurement is implemented as an online module to Braids [24] are representative counter-sharing compression process packets in real-time, one way is to implement it on approaches. CSM sketch splits each flow’s size among several network processors at the incoming/outgoing ports and use counters that are randomly selected from a counter pool. on-chip cache memory. However, the commonly used cache This strategy is implemented by randomly selecting a counter on processor chips is SRAM, typically with limited hardware from the flow’s counter vector and increasing the counter resources, which may have to be shared among multiple by one. Counter Braids establishes a hierarchical counter functions for routing, performance, and security purposes. In architecture by braiding the counter bits with random graphs to such a context, the amount of memory allocated for measure- solve the central problems (counter space and flow-to-counter ment may be tiny. Several methods aim to leverage SRAM rule). These approaches achieve a reasonable compression rate; to accomplish measurement tasks. Most existing studies are however, the decompression process is relatively slow. devoted toward excellent counter representation to ensure that Virtual estimators. Some virtual estimators that facilitate the counters occupy as little memory as possible to fit into memory sharing for recent cardinality estimation solutions the off-chip SRAM. Furthermore, several studies have been have also been proposed. Virtual HyperLogLog [91] shares conducted to break the off-chip SRAM’s throughput bound to a register array C of m HLL registers to store the packet guarantee per-packet processing. information of all flows. For an arbitrary flow f, where f is Single counter compression. Many studies have adopted the flow label, it pseudo-randomly selects s registers from C to a single-counter-based compressed representation approach. logically form a virtual counter Cf , and use Cf to encode the SAC [85] and SA counter [84] both propose a method such as packets in flow f. Virtual Register Sharing [89] allocates each floating-point number representation to enlarge the counting flow with a virtual estimator, including PCSA [73], LogLog range or compress the counters. SAC statically divides the [74], and HyperLogLog [75], and these virtual estimators share available memory of q bits in a unit into two parts A and a common memory space. It demonstrates that sharing at the mode of size l and k, respectively, and represents a counter multi-bit register level is superior to sharing at the bit level. as A · 2r·mode, while SA counter augments a flag bit to Apart from the above-mentioned methods, some other dynamically adjust the mode bits to enlarge the counting range. strategies have been developed to solve the throughput bound Sampling strategy. Several sampling methods have been problem by exploiting the speed of SRAM. To overcome developed to reduce the processing overhead and improve the bottleneck, Cache-assisted Stretchable Estimator (CASE) the estimation accuracy of the counter estimator. ANLS [55] [212] uses the on-chip memory as the fast cache of the off- proposes an estimator that dynamically adjusts the sampling chip SRAM. It counts hot items in the on-chip memory and rate for a flow depending on the number of packets that writes back to the off-chip memory according to the cache’s have been counted to improve the measurement accuracy for estimation. Because of the heavy-tailed distribution of traffic, small flows. DISCO [16] extends ANLS by regulating the most accesses to the counters will occur in the cache. The counter value to be a real increasing concave function of throughput can be increased remarkably while the accuracy the actual flow length and supports both packet counting and will be improved due to the non-compression counting. byte counting with high accuracy. Counter Estimation Decou- 2) Implementation in DRAM: The prevailing view is that pling for Approximate Rates (CEDAR) [211] decouples and SRAM is too expensive to implement large counter arrays, shares pre-given estimation values by pointers. By sampling whereas DRAM is too slow to provide line speed updates. This the packets according to the current and following counters, view is the main premise of a number of hybrid SRAM/DRAM CEDAR can dynamically update the pointers. ICE-Buckets architectural proposals [213] [214] [215] [216] that still require [57] presents a closed-form representation of the optimal substantial amounts of SRAM for large arrays. However, some estimation function and introduces an independent bucket- studies [217] [60] [218] [219] have presented a contrarian based counter architecture to improve the accuracy further. view that modern commodity DRAM architectures, driven by Bit-sharing strategy. There are several shared-bit compres- aggressive performance roadmaps for consumer applications, sion approaches for reducing the wastage of high-order bits have advanced architectural features that can be exploited to of cold items. ABC [104] uses the space from the adjacent make DRAM solutions practical. counters by operations such as bit borrowing and combination Lin and Xu [217] first highlighted the potential of leveraging when a counter overflows. Counter Tree [93] shares the high- DRAM for per-flow network measurement. They proposed order bits of large virtual counters, which are constructed from two schemes, one based on the replication of counters across multiple physical counters organized in a tree structure and memory banks and the other based on the randomized dis- span a path crossing multiple levels of the tree. OM sketch tribution of counters across memory banks to maintain wire- [95] shares the high-order bits by a hierarchical CSM sketch speed updates to large counter arrays. Compared with existing [109]. It can achieve close to one memory access and one hybrid SRAM/DRAM counter architectures, this strategy can hash computation for each insertion or query while achieving achieve the same update rates to counters without needing high accuracy. Pyramid Sketch [94] dynamically assigns an a nontrivial amount of SRAM for storing partial counts. appropriate number of bits for different items with different Further, they proposed a randomized DRAM architecture in frequencies. It also employs a tree structure. [60] [218], which harnesses the performance of modern com- modity DRAM architectures by interleaving counter updates counters and DRAM counters to temporarily hold the SRAM in multiple memory banks. The architecture consists of a counters’ indices that have overflowed and must be made to simple randomization scheme, a small cache, and small request the DRAM in the future. A simple also queues to statistically guarantee a near-perfect load-balancing guarantees that SRAM counters do not overflow in bursts that of counters to the DRAM banks. are sufficiently large to fill the write-back buffer even in the 3) Implementation in hybrid SRAM-DRAM: Statistics worst case. This approach consumes significantly fewer bits in counters are the fundamental unit of sketch-based traffic SRAM and proposes a simple CMA scheme compared with a measurement. Maintaining these counters at high line speed heap in LCF [213] and tree-like structure in LR(T ) [214]. is a challenging research problem. Briefly, two requirements Roeder et al. [215] extended LCF [213] and LR(T) [214] must be met to solve the per-flow measurement problem [85]: with multilevel counter memory architecture to reduce the (1) to store a large number of counters and (2) to update amount of fast memory required for nonuniform traffic pat- a large number of counters per second. Although chip and terns. They observed that many counters in previous methods large DRAM can easily satisfy these requirements, DRAM never reach a counter close to overflow; hence, using several access times of 50-100 ns cannot accommodate many counter- SRAM levels to store the lower bits of every counter may be a updates. Meanwhile, expensive SRAM with access times of better strategy. Whenever a counter overflows, i.e., exceeds the 2-6 ns allows many more counter increments per second; maximum value, it will update the corresponding counters in however, is not economical for many counters. the next level. Comprehensive experiments have demonstrated A hybrid SRAM/DRAM structure can help store the mas- that multi-level counter memory architecture can reduce the sive number of counters to solve this problem. Several methods required amount of fast memory storage by as much as 28%. have been developed to use DRAM to maintain statistics Lall et al. [220] propose a hybrid SRAM/DRAM algorithm counters with a small, fixed amount of SRAM. DRAM main- for measuring all elephant and medium-sized flows with strong tains N counters of M bits, while SRAM stores N counters accuracy guarantees. They employed a spectral Bloom filter of m bits, where (m