Measuring Large-Scale Distributed Systems: Case of BitTorrent Mainline DHT

Liang Wang Jussi Kangasharju

Department of Computer Science, University of Helsinki, Finland

www.helsinki.fi/yliopisto 1 Why do people measure?

System measurement is crucial in computer science in terms of following perspectives:

• Study system performance in real world. • Fine-tune system parameters. • Detect anomalies. • Locate bottlenecks. • Study user behaviors.

www.helsinki.fi/yliopisto 2 Why do people continuously measure?

Tons of measurement papers on P2P system already exist, why do people still continue measuring? Does it make sense?

• Different method may lead to different “picture”. • System evolves, more implementations. • User behavior changes. • Environment changes (e.g. government policy). • Measurement methodology improves!

www.helsinki.fi/yliopisto 3 What are we doing here?

After BitTorrent has been actively studied for over a decade, what we are doing here?

Yet another BitTorrent measurement paper?

• Identify a systematic error in previous measurement work which leads to over 40% estimate error! • Propose a new methodology which generates more accurate result with less overheads!

www.helsinki.fi/yliopisto 4 BitTorrent Mainline DHT basics

Our measuring target is BitTorrent Peer 11 Peer 95 A (Peer 29) Mainline DHT (MLDHT). 4 Peer 33 C (Peer 82) BT Protocol 1

2 Announce Peer (29,59) MLDHT implements the minimum Peer 78 Get Peers Peer 36 (82,59) functionality of , has only 4 3 control messages. Peer 71 Peer 43

Peer 65 Peer 51 B (Peer 57) Node has 160-bit ID. Nodes in n-bit zone share the same prefix of n bits in their IDs.

ID1: 0010111110110101……010 Example: two nodes from 6-bit zone à ID2: 0010110011101011……000

www.helsinki.fi/yliopisto 5 A seemingly trivial common technique

A common “technique”: Take a sample from system, use it as zone density, scale up! So what’s wrong here?

sample a

0 5 3 4 5 1 2 2 4 sampling 6 6 n-bit zone sample b sample c 0 8 9 7 3 8 3 0 2 1 9 7 9 6 4

[!] Missing node issue refers to the problem that a crawler systematically misses some nodes when it is crawling a target zone.

www.helsinki.fi/yliopisto 6 Should we recheck previous work?

Lot of previous work and demography research made their claims on this number – network size. • Qualitatively correct, no big problem. • Quantitatively, suspicious!

What if, this fundamental number – network size, is wrong?

Further more, how to answer the following questions: • What is the estimate error? • How to reduce the estimate error?

www.helsinki.fi/yliopisto 7 Why missing node was overlooked?

All the measurements have errors, or we can forget about statistics completely. In terms of MLDHT, it is from missing node issue. Why it was overlooked by previous work?

1e+7 • Overly optimistic and ideal assumptions 1e+6 on MLDHT protocol in a imperfect world. 1e+5

1e+4 • Node density in different n-bit zones. A 1e+3 deceivable figure, very stable behavior # of nodes 1e+2 between 5 ~ 23-bit zone. 1e+1 1e+0 • Can we safely scale up zone density? 0 5 10 15 20 25 30 n-bit zone

Estimate error due to the missing node issue will be magnified exponentially after simple scaling up! (So no linear-regression!)

www.helsinki.fi/yliopisto 8 A straightforward but strong evidence

An experiment where 20 simultaneous samples were taken from 12-bit zone in MLDHT. Each contains 3000 ~ 3500 nodes.

Our contribution 6000 # of distinct nodes increases as we + merge more and more samples. 5500 Accuracy Over 40% estimate err The increase rate decays fast. 5000 4500 - + Overheads Each sample only covers part of

4000 the zone. # of distinct nodes or! 3500 - Estimate error originates from 3000 5 10 15 20 missing node issue! # of merged samples

Previous work

www.helsinki.fi/yliopisto 9 Can simultaneous samples save us?

Simultaneous samples can definitely improve the accuracy, and drive our estimate closer to the real value. However,

• Taking simultaneous samples is expensive!

• We still don’t know the estimate error?

• If we don’t know how far we are to the real value, then how many simultaneous sample we should take?

www.helsinki.fi/yliopisto 10 How to fix the problem?

Model the measurement into a simple stochastic process – Bernoulli process. Then we can apply all the statistical analysis.

• Conceptually, a crawler goes through all the nodes in a target zone one by one.

• Each node has two options: being selected or being ignored.

• What is the probability (p) of being selected?

• Correction factor is defined as 1/p.

www.helsinki.fi/yliopisto 11 An analogy for system measurement

Assume we have a box of black marbles, and we are only allowed to take a “small” sample from it. We don’t know how much percent the sample contributes to the whole population. How are we going to estimate the number of marbles in the box?

• Mix certain amount of white marbles into the box. • Make sure they are well mixed, then take a sample. • Count how many white marbles in the sample. • Calculate correction factor (1/p). • Then inferring the number of black balls is trivial.

www.helsinki.fi/yliopisto 12 How do we measure in practice?

Aim: fast, accurate, low overheads, and minimum impact on the system.

• Choose a random zone. • Insert “controlled” nodes into the zone (less than 1%). • Take a sample from the zone, count the “controlled” nodes and estimate p. • Repeat the experiments until reaching satisfying variance.

50

40 An experiment where 500 estimates of p 30 were calculated. The samples were taken

20 from 12-bit zone, and the estimates passed

# of estimates p Jarque-Bera test. 10

0 0.7 0.75 0.8 0.85 0.9 0.95 estimate of p

www.helsinki.fi/yliopisto 13 Correction factor – summarize

• Correction factor is not a universal constant, but a simple methodology.

• Lead to more accurate result with explicit estimate error.

• Lower down the requirements on crawler design (performance, etc.).

• Make the results from different crawlers consistent.

• In a given environment, correction factor is very stable.

• On the actual measurement platform, moving average is a good option and only needs periodical calibration.

• No need to choose a specific zone, but recommend (5~20)-bit zone.

www.helsinki.fi/yliopisto 14 Validation of Methodology

www.helsinki.fi/yliopisto 15 Step-by-step validation

We validated each step of our methodology from the assumptions to the final outcome.

• Uniform node density: We carefully examined a large set of samples (over 32000) from the different parts.

• Missing node: We identified and verified this by “merging samples” and “injecting controlled nodes” experiments.

• Validity of correction factor: We designed large-scale emulation in controlled environment.

www.helsinki.fi/yliopisto 16 Validation in controlled environment

• A high performance computing cluster of 1 240 nodes was used to deploy a million- 10ïbit zone 0.9 12ïbit zone peer emulation. 0.8

• Different mixes of three implementations to p value 0.7 mimic a realistic ecosystem. 0.6

• Small variation, estimates are consistent 0.5 1:0:0 8:1:1 6:2:2 4:3:3 after applying correction factor. Mix Percent (MLBT:ARIA:LIBT) 13-th IEEE International Conference on Peer-to-Peer Computing

App Percent (%) 10-bit 12-bit MLBT ARIA LIBT p stdev Corr. Factor p stdev Corr. Factor 100 0 0 0.8215 0.0173 (1.1681,1.2708) 0.8437 0.0263 (1.1157,1.2641) 80 10 10 0.8045 0.0164 (1.1943,1.2958) 0.8213 0.0291 (1.1370,1.3104) 60 20 20 0.8198 0.0211 (1.1601,1.2860) 0.8501 0.0243 (1.1127,1.2477) 40 30 30 0.8109 0.0190 (1.1780,1.2938) 0.8372 0.0257 (1.1254,1.2726) TABLE II: 10-bit and 12-bit zone’s p value with 95% confidence interval as a function of different mix percents of three applications. (MLBT: Mainline BitTorrent; ARIA: Aria; LIBT: )

www.helsinki.fi/yliopisto 17 Simultaneous samples 1 2 4 8 16 20 Without CF 13,021,184 16,642,048 19,390,464 21,254,144 22,245,376 22,540,288 With CF 21,951,659 22,077,793 22,187,636 22,328,867 22,915,635 23,195,538 Error 40.68% 24.62% 12.61% 4.81% 2.92% 2.82% TABLE III: Estimation with and without correction factor. The numbers report the estimated size of the network when running n simultaneous samples. (CF: Correction Factor)

12-bit zones in our controlled test, whereas we have observed differences of 6–9% in the real MLDHT. As the zone gets larger, p decreases which makes the accuracy of the estimate MLDHT Monitor System Visualizer lower, necessitating the use of the correction factor. Scaling the node density without considering the correction factor is guaranteed to underestimate the network size, but there is no way to estimate how much lower it is. Our methodology Crawler Analyzer eliminates this problem. The correction factor is independent of the underlying crawling method and can be applied to any existing crawler. Using a very large number of simultaneous samplers in parallel obviates the need for the correction factor, but requires a con- Maintainer Injector siderable expenditure in resources for crawling. The correction factor strikes a trade-off, by allowing a much lower use of measurement resources and still obtaining the same level of Fig. 6: Principal components of monitoring system accuracy as a large number of simultaneous samples would provide. Table III shows the network size estimated by using a different number of simultaneous parallel samplers with and Analyzer component, where the p value will be calculated without the correction factor. As we see, using only 1 or 2 by checking the occurrence of controlled nodes. Finally, the samplers, as seems common in literature, will yield an error on estimate and other relevant information will be stored in the the order of 20–40%. Naturally, as shown in Table III multiple database and visualized by Visualizer component. parallel samplers improve the accuracy of our method as well, As mentioned before, for each crawl, obtaining several although the error between 1 and 20 parallel samplers is only hundreds of simultaneous samples to calculate p is expensive. on the order of 5%. However, correction factor does not give So in practice, p is the moving average of all available an answer for exact causes of missing node issue which can measurements and is calculated and refreshed in each crawl be manifold. Finding those causes will be our future work. by putting more weight on the new value. The formula we use is p =0.2 p +0.8 p . It turns out p value in our ∗ old ∗ new IV. EXPERIMENTS system is rather stable, always around 82%. We now present the implementation of our crawler and B. Deployment discuss practical aspects related to data collection. We use two nodes for sampling, both Dual Intel Xeon E5440 @ 2.83GHz with quad cores, 32 GB memory and Gigabit A. System Architecture connection to the Internet. The operating system is Debian Figure 6 shows the four principal components of our system. SMP with Linux 2.6 kernel. On each node, we set up a crawler An efficient Crawler lies at the core of the whole system. with its own sampling policy. One is called Fixpoint Crawler, It can finish a crawl within 5 seconds, trying to gather which always uses the same target ID when sampling. The as many nodes as possible in a target zone. Beneath the other is called Randompoint Crawler, which generates a new crawler, the Maintainer component maintains a set of over random target ID whenever it takes a sample. The sampling 3000 active nodes, and randomly provides 100 nodes among frequency is twice per hour, and there is 15-minute difference these long-lived nodes to bootstrap the crawler. The Injector between the two crawlers. component is responsible for injecting controlled nodes into There are two reasons for setting up a pair of crawlers. the monitored target zone. Then the sample will be sent to The first reason is to prevent the sample gaps due to the

7 Validation in open environment

Estimate with and without CF

25000000 13-th IEEE International Conference on Peer-to-Peer Computing

20000000 13-th IEEE International Conference on Peer-to-Peer Computing App Percent (%) 10-bit 12-bit 15000000 MLBT ARIA LIBT p stdev Corr. Factor p stdev Corr. Factor 100App Percent0 (%) 0 0.8215 0.017310-bit(1.1681,1.2708) 0.8437 0.026312-bit(1.1157,1.2641) MLBT80 ARIA10 LIBT10 0.8045p 0.0164stdev (1.1943,1.2958)Corr. Factor 0.8213p 0.0291stdev (1.1370,1.3104)Corr. Factor 10000000 10060 200 200 0.82150.8198 0.01730.0211 (1.1681,1.2708)(1.1601,1.2860) 0.84370.8501 0.02630.0243 (1.1157,1.2641)(1.1127,1.2477) 8040 1030 1030 0.80450.8109 0.01640.0190 (1.1943,1.2958)(1.1780,1.2938) 0.82130.8372 0.02910.0257 (1.1370,1.3104)(1.1254,1.2726) Estimate of network size Estimatenetwork of 5000000 60 20 20 0.8198 0.0211 (1.1601,1.2860) 0.8501 0.0243 (1.1127,1.2477) TABLE II: 10-bit and40 12-bit zone’s30 p30value0.8109 with 95%0.0190 confidence(1.1780,1.2938) interval0.8372 as a function0.0257 of(1.1254,1.2726) different mix percents of three applications. (MLBT: Mainline BitTorrent; ARIA: Aria; LIBT: libtorrent) 0 TABLE II: 10-bit and 12-bit1 zone’s p value2 with 95% confidence4 interval8 as a function16 of different mix20 percents of three applications.Without (MLBT: CF Mainline13021184 BitTorrent;16642048 ARIA: Aria;19390464 LIBT: libtorrent)21254144 22245376 22540288 WithSimultaneous CF 21951659 samples 1 22077793 2 221876364 223288678 2291563516 2023195538 Without CF 13,021,184 16,642,048 19,390,464 21,254,144 22,245,376 22,540,288 SimultaneousWith CF samples 121,951,659 222,077,793 422,187,636 822,328,867 1622,915,635 2023,195,538 WithoutError CF 13,021,18440.68% 16,642,04824.62% 19,390,46412.61% 21,254,1444.81% 22,245,3762.92% 22,540,2882.82% With CF 21,951,659 22,077,793 22,187,636 22,328,867 22,915,635 23,195,538 TABLE III: EstimationError with and without40.68% correction24.62% factor. The12.61% numbers report4.81% the estimated2.92% size of the2.82% network when running n simultaneous• In samples. open (CF:environment, Correction Factor) we don’t know the true network size, but we can TABLE III: Estimationvalidate with effectiveness and without correction of CF by factor. comparing The numbers how report two the methods estimated size converge. of the network when running n simultaneous samples. (CF: Correction Factor) 12-bit zones• inSmall our controlled number test, of whereas simultaneous we have observed (without CF) samples leads to big error. differences of 6–9% in the real MLDHT. As the zone gets 12-bit zones in our controlled test, whereas we have observed larger, p decreases which makes the accuracy of the estimate MLDHT Monitor System Visualizer lower,differences necessitating• of 6–9%CF effectively the in use the realof the reduces MLDHT. correction Asthe factor. the error zone Scaling getseven with small amount of samples. larger, p decreases which makes the accuracy of the estimate the node density without considering the correction factor is MLDHT Monitor System Visualizer guaranteedlower, necessitating to underestimate the use of the the network correction size, factor. but there Scaling is the node density without considering the correction factor is www.helsinki.fi/yliopisto 18 no way to estimate how much lower it is. Our methodology Crawler Analyzer eliminatesguaranteed this to underestimateproblem. the network size, but there is no way to estimate how much lower it is. Our methodology The correction factor is independent of the underlying Crawler Analyzer eliminates this problem. crawling method and can be applied to any existing crawler. UsingThe a correction very large number factor isof simultaneous independent samplers of the underlying in parallel obviatescrawling the method need and for the can correction be applied factor, to any but existing requires crawler. a con- Maintainer Injector Using a very large number of simultaneous samplers in parallel siderable expenditure in resources for crawling. The correction Maintainer Injector factorobviates strikes the need a trade-off, for the correction by allowing factor, a much but requires lower use a con- of measurementsiderable expenditure resources in resources and still obtaining for crawling. the The same correction level of Fig. 6: Principal components of monitoring system accuracyfactor strikes as a a large trade-off, number by of allowing simultaneous a much samples lower use would of measurement resources and still obtaining the same level of provide. Table III shows the network size estimated by using Fig. 6: Principal components of monitoring system aaccuracy different as number a large of number simultaneous of simultaneous parallel samplers samples with would and provide. Table III shows the network size estimated by using Analyzer component, where the p value will be calculated without the correction factor. As we see, using only 1 or 2 by checking the occurrence of controlled nodes. Finally, the samplers,a different as number seems ofcommon simultaneous in literature, parallel will samplers yield an with error and on Analyzer component, where the p value will be calculated without the correction factor. As we see, using only 1 or 2 estimate and other relevant information will be stored in the the order of 20–40%. Naturally, as shown in Table III multiple databaseby checking and the visualized occurrence by Visualizer of controlledcomponent. nodes. Finally, the parallelsamplers, samplers as seems improve common the in accuracy literature, of will our yield method an error as well, on estimate and other relevant information will be stored in the the order of 20–40%. Naturally, as shown in Table III multiple As mentioned before, for each crawl, obtaining several although the error between 1 and 20 parallel samplers is only database and visualized by Visualizer component.p parallel samplers improve the accuracy of our method as well, hundreds of simultaneous samples to calculate is expensive. on the order of 5%. However, correction factor does not give SoAs in mentioned practice, p before,is the for moving each average crawl, obtaining of all available several although the error between 1 and 20 parallel samplers is only p an answer for exact causes of missing node issue which can measurementshundreds of simultaneous and is calculated samples and to calculate refreshed inis each expensive. crawl on the order of 5%. However, correction factor does not give p be manifold. Finding those causes will be our future work. bySo putting in practice, more weightis the on moving the new average value. The of all formula available we an answer for exact causes of missing node issue which can measurements and is calculated and refreshed in each crawl use is p =0.2 pold +0.8 pnew. It turns out p value in our be manifold. Finding those causes will be our future work. by putting more∗ weight on∗ the new value. The formula we IV. EXPERIMENTS system is rather stable, always around 82%. use is p =0.2 pold +0.8 pnew. It turns out p value in our We now present the implementation of our crawler and ∗ ∗ IV. EXPERIMENTS B.system Deployment is rather stable, always around 82%. discuss practical aspects related to data collection. We now present the implementation of our crawler and B.We Deployment use two nodes for sampling, both Dual Intel Xeon E5440 discuss practical aspects related to data collection. @ 2.83GHz with quad cores, 32 GB memory and Gigabit A. System Architecture connectionWe use two to nodes the Internet. for sampling, The operating both Dual system Intel Xeon is Debian E5440 @ 2.83GHz with quad cores, 32 GB memory and Gigabit A.Figure System 6 Architecture shows the four principal components of our system. SMP with Linux 2.6 kernel. On each node, we set up a crawler An efficient Crawler lies at the core of the whole system. withconnection its own to sampling the Internet. policy. The One operating is called systemFixpoint is Crawler Debian, Figure 6 shows the four principal components of our system. It can finish a crawl within 5 seconds, trying to gather whichSMP with always Linux uses 2.6 the kernel. same On target each node, ID when we set sampling. up a crawler The An efficient Crawler lies at the core of the whole system. Fixpoint Crawler as many nodes as possible in a target zone. Beneath the otherwith its is own called samplingRandompoint policy. Crawler One is, called which generates a new, It can finish a crawl within 5 seconds, trying to gather crawler, the Maintainer component maintains a set of over randomwhich always target usesID whenever the same it target takes ID a sample. when sampling. The sampling The as many nodes as possible in a target zone. Beneath the Randompoint Crawler 3000 active nodes, and randomly provides 100 nodes among frequencyother is called is twice per hour, and there, is which 15-minute generates difference a new crawler, the Maintainer component maintains a set of over these long-lived nodes to bootstrap the crawler. The Injector betweenrandom target the two ID crawlers. whenever it takes a sample. The sampling component3000 active is nodes, responsible and randomly for injecting providescontrolled 100 nodes nodes amonginto frequencyThere are is twice two reasons per hour, for and setting there isup 15-minute a pair of difference crawlers. Injector thethese monitored long-lived target nodes zone. to bootstrap Then the the sample crawler. will The be sent to Thebetween first the reason two crawlers. is to prevent the sample gaps due to the component is responsible for injecting controlled nodes into There are two reasons for setting up a pair of crawlers. the monitored target zone. Then the sample will be sent to The first reason is to prevent the sample gaps due to the 7 7 Crawler impacts the result

We chose libtorrent as an example because it has been used in several projects as a crawler.

• Without tweaking, libtorrent only reports 1/6 of the actual network size.

• After tweaking, results are improved. But still, only 1/4 to 1/3 of the actual nodes are reported.

• Lots of things can impact the result: queue size, bucket size, mechanism like banning suspicious nodes, abandon malformed messages, routing table operations, etc.

www.helsinki.fi/yliopisto 19 Actual measurement platform

Five principal components MLDHT Monitor System Visualizer

• Visualizer: visualizes data. Crawler Analyzer • Maintainer: maintains several hundreds active nodes for bootstrapping a crawl. Maintainer Injector

• Injector: injects “colored” nodes into the target zone, and replaces them periodically. • Crawler: takes a sample from a target zone. • Analyzer: counts the“ controlled” nodes and calculates p.

www.helsinki.fi/yliopisto 20 Actual measurement deployment

• Measurement platform started on December 17th, 2010.

• Two physical nodes with one crawler on each, to prevent sample gap due to software/hardware failures.

• Two crawlers use different sampling policies, to cross- correlate with each other.

• Sampling frequency was every 30 minutes, and increased to every 10 minutes since 2013.

• Over 32000 samples were collected.

www.helsinki.fi/yliopisto 21 Measurement – system evolution

7 x 10 2011.01 ~ 2011.10 • P2P is not dying. 10% increase from 2.2 Weekdays 2011 to 2012, then remains stable. 2.1 Weekends • Some countries had an increase, 2 some had a drop. 1.9 1.8 • Strong diurnal and seasonal pattern 1.7 still exists. 1.6 Monthly average # of nodes

1.5 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Month in 2011

3e+07 2011.03.07 - 2011.03.13 2012.03.05 - 2012.03.11 2.8e+07 2013.04.08 - 2013.04.14 2.6e+07 2.4e+07 2.2e+07 2e+07 1.8e+07 Estimate # of nodes 1.6e+07 1.4e+07 Mon Tue Wed Thu Fri Sat Sun

www.helsinki.fi/yliopisto 22 Measurement - anomaly detection

During normal time Under Sybil-attack

1.2 1.2 1

CF 1 CF 0.8 0.8 0 5 10 15 20 0 5 10 15 20

1.2 1.2 1 1 RTT 0.8 RTT 0.8 0 5 10 15 20 0 5 10 15 20

1.2 1.2 1 1 Density 0.8 Density 0.8 0 5 10 15 20 0 5 10 15 20

A real-world Sybil-attack in MLDHT on Jan. 6, 2011. The attack was from two virtual machines of Amazon EC2, and started from 6:00 am.

For details, refer to: Liang Wang; Kangasharju, J., "Real-world sybil attacks in BitTorrent Mainline DHT," GLOBECOM, 2012

www.helsinki.fi/yliopisto 23 Conclusion

• Correction factor is not only a measurement tool, but also a technique to equip the current and future measurement.

• Make the results from different measurement tools consistent, and with explicit measurement error.

• Lower down the requirements on crawlers, make the crawling more efficient with less overheads.

• Some other interesting applications like anomaly detection.

www.helsinki.fi/yliopisto 24 Thank you!

Questions?

www.helsinki.fi/yliopisto 25 Measurement – user behavior

10% doesn’t necessarily imply user behavior has changed.

1 • We calculate 0.5 autocorrelation function 0 over the samples from

−0.5 2011 (green) and 2012 Autocorrelation

−1 (red). 0 50 100 150 200 250 300 Lag

1 • After the noise was filtered

0.5 out, system exhibits almost exactly the same 0 behavior over different −0.5 Autocorrelation times.

−1 0 50 100 150 200 250 300 Lag

www.helsinki.fi/yliopisto 26 Measurement – real world event

www.helsinki.fi/yliopisto 27 Causes for missing node issue

The possible causes for missing node issue can be manifold:

• MLDHT protocol’s inherent problem. • Network dynamics: nodes join/leave. This problem becomes severer if crawling time is long. • Lost messages due to network congestions. • Some protect mechanisms: e.g. blacklist, banning suspicious node, dropping malformed message, small bucket size, small queue size. • Stale or false information in the routing table, mainly due to implementation issues. • Crawler is not efficient enough in terms of greediness and speed. • Firewall issue.

www.helsinki.fi/yliopisto 28