Secure and Robust Overlay Content Distribution

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

Hun Jeong Kang

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy

October, 2010 c Hun Jeong Kang 2010 ALL RIGHTS RESERVED Acknowledgements

I am fortunate to have been advised by an excellent thesis committee. I am grateful to my advisors, Yongdae Kim and Nicholas Hopper, for their valuable advice and support. I would also like to thank my other committee members, Jon Weissman and Andrew Odlyzko, for helpful comments and questions on earlier drafts. I would like to specifically acknowledge Yongdae for his continuing support, patience, and encouragement through my PhD program. This thesis has benefited from collaboration with my colleagues. The use of “we” throughout this thesis is meant to acknowledge to their contribution. In improving DHT lookups, Eric Chan-Tin helped me to conduct experiments in Kad and discussed the solutions to improve the performance of Kad lookups. I also thank Peng Wang and James Tyra for the discussion in the early phase of the work. In making network coding practical and secure, I am indebted to Aaram Yun and Jung-Hee Chun for their contribution in cryptographical aspects of the system described in the thesis. They played in a key role in designing a homomorphic digital signature to protect network coding from pollution attacks and in providing theoretical proofs. Aaram especially provided me the insightful mathematical comments on network coding schemes. I also thank Eugene Vasserman for his valuable comments and help with implementing the secure network coding system.

i Dedication

강승주 홍정선 정병조 황규자 고진환 홍정렬 황윤옥 정진령 정은성 김성미

To my family For their support, patience, and sacrifice

i Abstract

With the success of applications spurring the tremendous increase in the volume of data transfer, efficient and reliable content distribution has become a key issue. Peer- to-peer (P2P) technology has gained popularity as a promising approach to large-scale content distribution due to its benefits including self-organizing, load-balancing, and fault-tolerance. Despite these strengths, P2P systems also present several challenges such as performance guarantees, reliability, efficiency, and security. In P2P systems deployed on a large scale, these challenges are more difficult to deal with because of the large number of participants, unreliable user behaviors, and unexpected situations. This thesis explores solutions to improve the efficiency, robustness, and security of large-scale P2P content distribution systems, focusing on three particular issues: lookup, practical network coding, and secure network coding. A distributed hash table (DHT) is a structured overlay network service that pro- vides a decentralized lookup for mapping objects to locations. This thesis focuses on improving the lookup performance of Kademlia DHT protocol. Although many stud- ies have proposed DHTs to provide a means of organizing and locating peers to many distributed systems, to the best of my knowledge, Kademlia is a unique DHT deployed on an Internet-scale in the real world. This study evaluates the lookup performance of Kad (a variation of Kademlia) deployed in one of the largest P2P file-sharing networks. The measurement study shows that lookup results are not consistent; only 18% of nodes located by storing and searching lookups are the same. This lookup inconsistency prob- lem leads to poor performance and the inefficient use of resources during lookups. This study identifies the underlying reasons for this inconsistency problem and the poor per- formance of lookups, and proposes solutions to guarantee reliable lookup results while providing the efficient use of resources. This thesis studies the practicality of network coding to facilitate cooperative content distribution. Network coding is a new data transmission technique which allows any nodes in a network to encode and distribute data. It is a good solution offering reliability and efficiency in distributing content, but the usefulness of network coding is still in dispute because of its dubious performance gains and coding overhead in practice. With

ii the implementation of network coding in a real-world application, this thesis measures the performance and overhead of network coding for content distribution in practice. This study also provides a lightweight yet efficient encoding scheme which allows network coding to provide improved performance and robustness with negligible overhead. Network coding is a promising data transmission technique. However, the use of net- work coding also poses security vulnerabilities by allowing untrusted nodes to produce new encoded data. Network coding is seriously vulnerable to pollution attacks where malicious nodes inject false corrupted data into a network. Because of the nature of the network coding, even a single unfiltered false data block may propagate widely in the network and disrupt correct decoding on many nodes, by being mixed with other correct blocks. Since blocks are re-coded in transit, traditional hash or signature schemes do not work with network coding. Thus, this thesis introduces a new homomorphic signature scheme which efficiently verifies encoded data on-the-fly and comes with desirable fea- tures appropriate for P2P content distribution. This scheme can protect network coding from pollution attacks without delaying downloading processes.

iii Contents

Acknowledgements i

Dedication i

Abstract ii

List of Tables iv

List of Figures v

1 Introduction 1 1.1 Reliable and Efficient Lookup ...... 2 1.2 Practical Network Coding ...... 4 1.3 Secure Network Coding ...... 8

2 Reliable and Efficient Lookup 11 2.1 Background ...... 12 2.1.1 Kad ...... 12 2.1.2 Kad Lookup ...... 13 2.2 Evaluation of Kad Lookup Performance ...... 15 2.2.1 Experimental Setup ...... 15 2.2.2 Performance Results ...... 16 2.3 Analysis of Poor Lookup Performance ...... 18 2.3.1 Characterizing Routing Table Entries ...... 19 2.3.2 Analysis of Lookup Inconsistency ...... 21

i 2.4 Improvements ...... 25 2.4.1 Solutions ...... 25 2.4.2 Performance Comparisons ...... 27 2.5 Object Popularity and Load Balancing ...... 30 2.6 Related Work ...... 34 2.7 Summary ...... 36

3 Practical Network Coding 37 3.1 Preliminaries ...... 38 3.1.1 Cooperative Content Distribution ...... 38 3.1.2 Random Linear Network Coding ...... 38 3.1.3 Performance and Overhead in Network Coding ...... 39 3.2 Practical Network Coding System ...... 43 3.2.1 System Architecture ...... 44 3.2.2 i-Code: Lightweight and Efficient Coding ...... 46 3.3 Evaluation ...... 49 3.3.1 Comparisons of encoding schemes ...... 49 3.3.2 Practicality Check ...... 54 3.4 Related work ...... 63 3.5 Summary ...... 65

4 Secure Network Coding 67 4.1 Preliminaries ...... 67 4.1.1 Threat Model ...... 68 4.1.2 Related work ...... 69 4.1.3 Requirements for content distribution ...... 71 4.2 Secure Network Coding ...... 73 4.2.1 Signature Scheme ...... 73 4.2.2 Comparisons ...... 74 4.3 Practical Consideration ...... 75 4.3.1 Parameter Setup ...... 75 4.3.2 Performance Boost ...... 76 4.4 Evaluation ...... 77

ii 4.5 Summary ...... 81

5 Conclusion 83

References 85

Appendix A. Security Analysis 91 A.1 KYCK Signature ...... 91 A.2 Batch Verification ...... 93

iii List of Tables

3.1 Parameters for generations and blocks ...... 62 4.1 Comparison of secure network coding schemes ...... 74

iv List of Figures

1.1 Block (block) scheduling problem ...... 5 2.1 Illustration of a GET lookup ...... 14 2.2 Performance of lookup: (a) search yield (immediately after PUT) (b)search yield (over 24 hour window) (c) search success ratio (over time) (d) search access ratio (by distance) ...... 17 2.3 Statistics on routing tables ...... 21 2.4 Illustration of how a lookup can be inconsistent ...... 22 2.5 Number of nodes at each distance from a target ...... 23 2.6 Number of replica roots at each distance from a target ...... 24 2.7 Lookup algorithm for Fix2 ...... 26 2.8 Lookup Improvement (Search Yield) ...... 27 2.9 Lookup Overhead ...... 28 2.10 Lookup performance over time ...... 29 2.11 (a) Lookup with real popular objects (b) Original Kad lookup for our objects (c) New Kad lookup for our objects (d) Load for each prefix bit for real popular objects and our objects ...... 31 3.1 Comparisons of downloads between and network coding . . . 40 3.2 Comparisons with the optimal downloading time ...... 41 3.3 Tradeoff between CPU overhead and block dependency ...... 43 3.4 System architecture ...... 44 3.5 i-Code design. Our encoding scheme requires only one block to be read from disk and one linear combination, greatly reducing encoding overhead. 47 3.6 Comparison of overhead ...... 50 3.7 Comparison of downloading time ...... 51

v 3.8 Detailed comparison of time overhead ...... 52 3.9 Comparison of block dependency ...... 53 3.10 Detailed comparison of time overhead ...... 53 3.11 Comparisons of downloading times with 256KB block size ...... 54 3.12 Comparisons of downloading in different environments time with 256KB block size ...... 55 3.13 Impact of a block size on downloading times ...... 56 3.14 Downloading times with flash-crowd ...... 57 3.15 Downloading times according to peer arrivals and departures ...... 58 3.16 Completeness according to the time of source departure times ...... 58 3.17 Downloading times of fast nodes in heterogeneous environments . . . . . 59 3.18 Impact of number of neighbors ...... 60 4.1 Signature verification time ...... 78 4.2 Signature verification time ...... 79 4.3 Signature verification time ...... 80 4.4 Signature verification time ...... 81

vi Chapter 1

Introduction

With the growth of computer networks, the recent popularity of applications in enter- tainment and business has spurred a tremendous increase in the volume of data transfer. In these applications, efficient and reliable content distribution has become a key issue. Traditionally, content distribution systems have been based on the -server model; clients download the entire content from dedicated servers. However, this client-server model comes at a high cost for running the servers: they are costly to maintain, band- width is expensive, and steps must be taken to prevent a server or data center from becoming a single point of failure. This failure of a central server leads to a halt in providing whole services even if all the clients are alive. To avoid some of these problems, the advent of peer-to-peer (P2P) technologies has provided a new paradigm for content distribution. All content receivers (or nodes) also become content providers cooperatively participating in the content distribution. Since their roles are equal, in contrast to the client-server relationship, they are called peers. They autonomously form an overlay network and contribute their bandwidth and computing resources for content distribution.1 The available resources of P2P systems grow as the number of peers in the network increases. Although individual peers have lower uptimes than dedicated servers and contribute less bandwidth, their numbers, theoretically, more than make up for their individual resource constraints. Furthermore, P2P technology helps content distribution scale better. The more popular the content is, the more peers participate in its distribution and thus the more peers

1 P2P, cooperative, and overlay content distributions are used interchangeably in this work.

1 2 contribute resources. In short, P2P architecture is a promising candidate to make content distribution more scalable, more fault-tolerable, and faster. Despite these strengths, P2P content distribution systems present several challenges such as performance guarantees, reliability, efficiency, and security. For P2P systems that are deployed on a large scale, these challenges are more difficult to deal with because of the large number of participants, unreliable user behaviors, and unexpected situations. In this thesis, we explore solutions to improve the reliability and efficiency of large-scale P2P content distribution systems, focusing on three particular issues: lookup, practical network coding, and secure network coding. The rest of this chapter presents an overview of our research followed by the details in the next chapters. Chapter 2 diagnoses reliability and efficiency of lookups, Chapter 3 explores the practicality of network coding, and Chapter 4 studies the security of network coding in practice.

1.1 Reliable and Efficient Lookup

In P2P content distribution systems, content is distributed to nodes in a network instead of a central server. Therefore, a peer should be able to locate where desired content exits and this lookup is an essential issue in a P2P system. In early P2P networks such as Napster, a central server stores the location information of all content in a distribution network. With this approach, peers simply query the server and can easily search desired content. However, this approach is not scalable and suffers a single point of failure. To solve these problems, decentralized and unstructured lookup is used in Gnutella [1] and Kazaa [2]. With this distributed approach, there is no centralized server and each node keeps the location information of all content stored locally. When a node attempts to locate content, it floods requests to the network. Therefore, this unstructured method for searching content is expensive. For efficient and scalable lookup, the research community proposed distributed hash tables (DHTs), also called structured overlays [3, 4, 5, 6, 7]. In DHTs, each peer has an overlay address and a routing table. When a peer performs a query for an identifier, the query is routed to the peer with the closest overlay address. The DHT enforces rules on the way that peers select neighbors to guarantee performance bounds on the number of hops needed to perform a query (typically O(log N) where N is the number of 3 peers in the network). A DHT provides a simple put/get interface, similar to traditional hash tables. If a node inserts a key-value pair (k, v), any peers can retrieve the value v with key k. Because a DHT provides a decentralized lookup service mapping objects to peers, it can also provide means of organizing and locating peers for use in higher- level applications in a large peer-to-peer (P2P) network. This potential to be used as a fundamental building block for large-scale distributed systems has led to an enormous body of work on designing highly scalable DHTs. Despite this, only a handful of DHTs have been deployed on the Internet-scale: Kad, Azureus [8], and Mainline [9], all of which are based on the Kademlia protocol [7]. This thesis evaluates and improves Kademlia lookup performance regarding relia- bility and efficiency in the use of resources. The study focuses on Kad because it is widely deployed with more than 1.5 million simultaneous users [10]. Furthermore DHT lookup has more significant portion in Kad than Azureus and Mainline that use DHT lookup for only bootstrapping. Like other DHTs, Kad uses a data replication scheme; object information is stored at multiple nodes (called replica roots). Therefore, a peer can retrieve the information once it finds at least one replica root. However, we observe that 8% of lookup operations for searching cannot find any replica roots immediately after publishing, which means they are unable to retrieve the information. Even worse, 25% of searchers fail to locate the information 10 hours after storing the information. This poor performance is due to inconsistency between storing and searching lookup processes; Kad lookup for the same objects map to an inconsistent set of nodes. From our measurement, only 18% of replica roots located by storing and searching lookup ser- vices are the same on average. Moreover, this lookup inconsistency causes an inefficient use of resources. We also find that 45% of replica roots are never located thus used by any searching peers for rare objects. Furthermore, when many peers search for popular information stored by many peers, 85% of replica roots are never used and only a small number of the roots suffer the burden of most requests. Therefore, we can see that Kad lookup is not reliable and waste resources such as bandwidth and storage for unused replica roots. Why are the nodes located by publishing and searching lookup inconsistent? Past studies [11, 12] on Kademlia-based networks have claimed that lookup results are differ- ent because routing tables are inconsistent due to dynamic node participation (churn) 4 and slow routing table convergence. We question this claim and examine entries in rout- ing tables of nodes around a certain key space. Surprisingly, the routing table entries are much more similar among the nodes than expected. Therefore, these nodes return a similar list of their neighbors to be contacted when they receive requests for the key. However, the Kad lookup algorithm does not consider this high level of similarity in routing table entries. As a result, this duplicate contact list limits the unique number of located replica roots around the key. The consistent lookup enables reliable information search although some copies of the information are not available due to node churn or failure. Then they can also provide the same level of reliability with the smaller number of required replica roots compared to inconsistent lookup, which means lookup efficiently uses resources such as bandwidth and storage. Furthermore, consistent lookup locating multiple replica roots provides a way to load-balancing. Therefore, we propose algorithms considering the routing table similarity in Kad and show how improved lookup consistency affects the performance. These solutions can improve lookup consistency up to 90% and eventually lead to guaranteeing reliable lookup results while providing efficient resource use and load-balancing. Our solutions are completely compatible with existing Kad clients, and thus incrementally deployable.

1.2 Practical Network Coding

Despite their advantages and popularity, existing cooperative content distribution sys- tems suffer decreased performance through poor design, overly high peer turnover, and unforeseen emergent properties of large peer groups. In a typical P2P content distri- bution system like BitTorrent [13], content is divided into pieces (or blocks).2 Peers exchange missing blocks with each other until they collect all the pieces of the content and reconstruct the original content. As soon as a node acquires at least one block, the node can offer the received block to others. Due to the direction of content flow, we refer to the uploader and downloader as the upstream and downstream nodes, re- spectively. This parallelizes downloads, such that peers can simultaneously download

2 By following the specification of BitTorrent, a piece refers to a part of content and a block describes the data that is exchanged between peers in this thesis. 5 different blocks from different nodes, achieving higher thought [13, 14]. However, this approach also poses significant challenges in the form of scheduling and availability problems. Peers must make decisions about how to upload and download content, which is called block scheduling. These decisions include which blocks they retrieve, from which peers they download blocks, and to which peers they provide blocks. It is difficult to find the optimal scheduling that minimizes downloading times especially when peers make local decisions without relying on central coordination. This is referred to as the block scheduling problem. To illustrate this problem, we take an example modified from [15]. In Figure 1.1(a), peer B is about to complete downloading block X from peer A, and peer C needs to decide which blocks to download from A and B. If C decides to download block X from A, then both B and C will have the same block X. This leads the problem that the link between B and C cannot be used and the downloading process of C will be delayed. This problem will be more difficult in larger scale P2P systems. To address the block scheduling problem, P2P systems use scheduling schemes such as random and local-rarest-first policies, but their block scheduling is still often referred to as inefficient [13, 16].

(a) Non-coding (b) Network Coding

Figure 1.1: Block (block) scheduling problem

The availability of data blocks may also affect the performance of content distribu- tion systems. P2P networks are dynamic in nature because peers may arrive, depart, or fail frequently, which is referred to as peer dynamics or churn. When some peers are not available, certain blocks may become rare. Peers missing the blocks should wait for their turns in a long line to receive rare blocks. This availability problem makes the 6 efficient scheduling harder. Even worse, some blocks may be unavailable as they are held by peers who happen to be offline. Recall that content cannot be reconstructed when even one block is missing. Therefore, peers fail to download the whole content due to the small portion of missing data. Applying network coding to P2P systems has been considered a solution to these problems [17, 18, 19, 20, 21, 15, 22, 23, 24]. We will refer to network coding-enabled P2P systems for content distribution as ncP2P. In contrast to ncP2P, P2P systems which do not use network coding will be referred to as non-coding systems or simply P2P-only. The key idea is to allow peers to “encode” their blocks when sending outgoing data to downstream nodes. In this thesis, we focus on linear network coding [25] which is commonly used in many studies. When a peer uploads a block to another node, it sends a linear combination of some or all of the blocks it has. This way peers no longer have to fetch a copy of each specific block; rather, a peer simply asks another node to send a coded block, without specifying a block index. In Figure 1.1(b), peer C will download a linear combination of block X and Y from A without worrying about which blocks B will have. Then B and C can exchange blocks with each other, which efficiently uses the link between B and C and minimize the downloading time. After a downloader receives enough linearly independent blocks, it can reconstruct the original content, eliminating the requirement that each block should be downloaded individually. Even when some peers having specific blocks leave the network, other nodes will not have difficulty in downloading coded blocks from remaining other peers and recovering the original content. Therefore, network coding can potentially provide better robustness and reliability for content distribution. Despite the benefits of network coding, it has not been widely used in real-world P2P systems for content distribution. There has been some doubt about the performance gains from network coding in practice. In addition, network coding has been blamed for its computational complexity and excessive resource use. In this thesis, we explore challenges and problems we face when network coding is applied to real-world P2P applications for content distribution. We eventually answer the following questions:

• How much can real-world applications benefit from network coding?

• How much overhead does network coding introduce in real environments? 7 • How can we improve network coding in order to make it more practical?

To answer these questions, we chose BitTorrent [13], one of the most popular P2P protocols, as a concrete application. We modified a BitTorent client [26] to use network coding and measured the performance and overhead of the system. With the network coding, the first issue we faced was to decide how to encode data. In linear network coding, a node must decide how many blocks are combined to generate an outgoing block, since encoding time is not trivial. We measured an encoding time of 2 milliseconds to combine two 256KB blocks when using commodity hardware. When blocks are stored on a disk, encoding time may increase significantly due to disk access delay, which varies depending on numerous factors such as disk speed, disk cache size, available system memory, and the number of page faults. We observed disk access times varying from 30 microseconds to 0.2 seconds to load 256KB of data. Network coding was originally formulated such that all blocks that were available to a peer were combined to produce an encoded block [25]. In the current content distribution, files are often quite large and consist of hundreds or thousands of blocks (or smaller numbers of blocks but with larger block sizes). If we assume a block size of 256KB for the large files, it would take several seconds to encode a single outgoing block. Therefore, it is almost impossible to use this “full” encoding in practice. To reduce the encoding overhead, peers can use fewer input blocks to generate a coded block in a series of schemes [27, 18, 22, 19]. However, this approach generates more dependent blocks, especially when too few input blocks are combined. In linear network coding, the usefulness of data is determined by linear dependency. These dependent blocks do not contribute “useful” data to other nodes, since they carry duplicate information from other blocks, thus wasting bandwidth and time. The high level of block dependency delays content propagation, since peers have difficulty in locating independent blocks. There is a direct tradeoff between encoding overhead and block dependency. Currently, no encoding scheme achieves both low encoding overhead and low block dependency. The primary contribution of our work is the design of i-Code, an encoding scheme satisfying both requirements of low encoding overhead and low levels of block depen- dency. i-Code combines only two blocks for every encoding operation, dramatically re- ducing the encoding overhead. However, it does not have the dependent block penalty faced by encoding schemes which combine few input blocks. The key idea is to emulate 8 an encoding scheme which combines many input blocks. To that end, each peer using i-Code maintains a “well-mixed” block which we call the accumulation block (a). When- ever a peer receives an independent block (w), it updates its accumulation block with the new data (a ← αa + βw, for randomly chosen coefficients α and β). When the peer encodes a new block, it selects a block from its local store and linearly combines it with the accumulation block. Therefore, all blocks the peer has are “accumulated” into a and mixing any block with the accumulation block has a similar effect of combining many blocks. We compare the performance and overhead of each network coding scheme and show that i-Code exhibits a low level of block dependency comparable to the full coding with significantly less overhead and fewer dependent blocks than sparse coding schemes. To address the practicality issue of network coding, we also provide performance and overhead in real environments. Prior studies on the benefits of network coding [15, 28] have been based on simulations or theoretical analyses, and may not reflect real network conditions. Although Gkantsidis et. al. provide an implementation in [22], there is no real-world performance comparison between their network coding-enabled implementa- tion and a non-coding system. Other recent work shows potential practical benefits of ncP2P over P2P alone [23, 19, 18], but the experiments were performed with a small number of nodes or small file sizes, in network settings that were overly favorable to network coding, thus limiting the generalizability of the findings. With our BitTorrent client using i-Code, we provide thorough empirical comparison between a P2P-only, and a ncP2P system, using many nodes communicating over a local area or wide-area net- work.3 Experimental results show that content distributing time of the ncP2P system decreases by 5–21% compared to the P2P-only (BitTorrent alone) and provides much better reliability and robustness.

1.3 Secure Network Coding

P2P content distribution systems run in inherently untrustworthy environments. Some nodes may be malicious and attempt to disrupt the distribution of content. They may launch a number of attacks aiming at disrupting P2P architectures by providing false information, dropping messages, or steering peers to malicious nodes. In this thesis we

3 We use PlanetLab [29] for wide-area network testing. 9 do not consider such types of attacks. Instead, we focus on those attacks that arise from the use of network coding for content distribution. Network coding-enabled P2P systems can provide improved performance and ro- bustness for cooperative content distribution. However, the use of network coding also poses security vulnerabilities by allowing any nodes to produce new encoded data. Some attackers intentionally send other nodes linear dependent blocks which do not contribute useful data to the receivers for the purpose of decreasing the diversity of data in the distribution network [30]. However, this type of attacks can be easily detected and pre- vented. On the other hand, network coding is seriously vulnerable to pollution attacks and we will focus on protecting network coding from this type of attacks. In the pollution attacks, malicious nodes inject corrupted data into a distribution network by sending other nodes blocks which are not linear combinations of original content. These corrupted blocks disrupt correct decoding; the original content cannot be reconstructed from the corrupted data. Before decoding, they cannot be filtered out by traditional hashes and digital signatures commonly used in P2P applications to verify the integrity of content pieces. In Bittorrent, for example, a content source creates a hash for each piece using a hash function and records the hash value into a metadata file which will be distributed to a downloader. When the downloader receives a particular block, the hash of the block is compared to the recorded hash to test whether the block has been modified or not. With network coding content blocks are encoded in transit by being mixed with other blocks. Because the source cannot provide hashes or signatures of encoded blocks in advance, the traditional methods of checking data integrity do not work with network coding. In addition, the nature of network coding aggravates the problem. If a false block is not filtered, it is combined with other correct blocks for encoding and thus corrupts those outgoing blocks which will in turn corrupt other blocks in another peers. Even a single unfiltered false block may propagate in the network while exponentially increas- ing the number of corrupted blocks. Although the final decoded file can be identified as corrupted with a traditional hash, the amount of bandwidth, storage, and computation time wasted on the invalid file cannot be recovered. Since it is not obvious which block was corrupted, downloaders must re-try downloading the entire file, potentially encoun- tering the same pollution. Therefore, P2P content distribution systems must implement 10 a protocol to verify coded data before passing it on to other nodes. Although several schemes have been proposed for securing network coding against pollution attacks, they require high computational overhead or are not appropriate for being applied to P2P systems. Schemes in [31, 32] have relatively higher computational overhead than other schemes because of pairing operations and the cost of signature generation and aggregation. Schemes in [33, 30, 34] do not allow hashes, checksum, or public key information to be distributed before content is prepared. Thus, they are not appropriate for P2P streaming, one type of content distribution. Schemes in [30, 35] require secure channels between peers or use symmetric keys, which cannot be easily realized in P2P environments. The scheme in [36] makes the size of each block variable and becomes inefficient when blocks traverse many nodes. This is not appropriate in P2P systems where we do not know how many nodes blocks traverse or how severe data expansion will eventually become. Our goal is to answer the question: can secure network coding still provide improved performance with affordable overhead compared to a system which does not use network coding? We first propose an efficient homomorphic signature scheme which verifies en- coded data on-the-fly. It also comes with desirable features appropriate for P2P content distribution which are mentioned when we list other schemes. We implement our scheme in a real-world BitTorrent client and measure its performance and overhead in real con- tent distribution. We finally conclude that our secure scheme can protect network cod- ing from attacks with affordable overhead and no delay to downloading processes. Chapter 2

Reliable and Efficient Lookup

The lookup to find where desired content exists is an essential issue in a P2P system. Distributed hash tables (DHTs) have been proposed as a solution for a distributed, efficient, and scalable overlay lookup service in a large-scale P2P system. DHTs can also provide a means of organizing and locating peers for use in higher-level applications in a large peer-to-peer (P2P) network. This potential to be used as a fundamental building block for large-scale distributed systems has led to an enormous body of work on designing highly scalable DHTs. Nevertheless, only a handful of Kademlia-based DHTs have been deployed on the Internet-scale. This chapter provides a study of the lookup performance of locating nodes respon- sible for replicated information focusing on Kad, a popular Kademlia-based DHT. Sec- tion 2.1 gives a brief background about Kad and its lookup. Section 2.2 presents the measurement results of Kad lookup performance. Section 2.3 identifies the factors un- derlying these observations and Section 2.4 presents solutions to improve the lookup performance. Section 2.5 provides more insights about the popularity of objects and load balancing in the Kad network. Section 2.6 surveys related work and Section 2.7 summarizes the chapter.

11 12 2.1 Background

2.1.1 Kad

Kad is a Kademlia-based DHT for P2P file sharing. It is widely deployed with more than 1.5 million simultaneous users [10] and is connected to the popular eDonkey file sharing network. The aMule and eMule clients are the two most popular clients used to connect to the Kad network. We examine the performance of Kad using aMule (at the time of writing, we used aMule version 2.1.3), a popular cross-platform open-source project. The other client, eMule, also has a similar design and implementation. Kad organizes participating peers into an overlay network and forms a key space of 128-bit quantifiers among peers. (We interchangeably use a peer and a node in this work.) It “virtually” places a peer onto a position in the key space by assigning a node identifier (Kad ID) to the peer. The distance between two positions in the key space is defined as the value of a bitwise XOR on their corresponding keys. In this sense, the more prefix bits are matched between two keys, the smaller the distance is. Based on this definition, we say that a node is “close” (or “near”) to another node or a key if the corresponding XOR distance is small in the key space. Each node takes responsibility for objects whose keys are near its Kad ID. As a building block for the file sharing, Kad provides two fundamental operations: PUT to store the binding in the form of (key, value) and GET to retrieve value with key. These operations can be used for storing and retrieving objects for file information. For simplicity, we only consider keyword objects in this work because almost the same operations are performed in the same way for other objects such as file objects. Consider a file to be shared, its keyword, and keyword objects (or bindings) where key is the hash of the keyword and value is the metadata for the file at a node responsible for the key. Peers who own the file publish the object so that any user can search the file with the keyword and retrieve the metadata. From the information in the metadata, users interested in the file can download it. Because a peer responsible for the object might not be available, Kad uses the data replication approach; the binding is stored at r nodes (referred to as replica roots and r is 10 in aMule). To prevent binding information from being stored at arbitrary locations, Kad has a “search tolerance” that limits the set of potential replica roots for a target. 13 2.1.2 Kad Lookup

In both PUT and GET operations, a Kad lookup for a target key (T ) performs the process of locating nodes which are responsible for T (nodes near T ). In each lookup step, a query is sent to peers closer to target T . Because a lookup in Kad is based on prefix matching, a querying node selects the nodes (contacts) which have the longest matched prefix bit length to T . The number of steps in a Kad lookup is bounded to O(log(N)) and a lookup is iteratively performed: each peer on the way to key T returns the next contacts to the querying node. The querying node can approach to the node closest to T by repeating lookup steps until it cannot find any nodes closer to T than those it has already learned in Phase1. In Phase2, the querying node attempts to discover nodes in the surrounding key space to support data replication (Phase1 and Phase2 are named for convenience). Kad takes an approach to send publish (PUBLISH REQ) and search (SEARCH REQ) requests in Phase2. This approach is an efficient strategy because the replica roots exist near the target and search nodes can locate the replica roots with high probability. (This will be explored in Section 2.3.2 in detail). This process repeats until termination conditions are reached – a specific amount of binding information is obtained or a time-out occurs. Figure 2.1 illustrates an example of a “simplified” GET lookup when node Q searches for key T . Phase1 works as follows: 1. Learning from a routing table: Q picks (“learns”) α contacts (nodes) closest to target T from all the nodes in its routing table (although α = 3 in Kad, Figure 2.1 shows the lookup process where α = 1 for this simple illustration). Node X is chosen. 2. Querying the learned nodes: Q “queries” these chosen nodes (i.e., node X) in parallel by sending KADEMLIA REQ messages for T . 3. Locating queried nodes: each of queried nodes selects β contacts closest to the target from its routing table, and returns those contacts in a KADEMLIA RES message (β is 2 in GET and 4 in PUT). In this example, node X returns Y and Yother (not shown in the figure). Once a node sends a KADEMLIA RES responding to a KADEMLIA REQ, the node is referred to as a “located” node.

4. Learning from the queried nodes: Q “learns” the returned contacts (Y and Yother) from queried nodes (X) and picks the α closest contacts (Y ) from its learned nodes. 5. Querying next contacts: Q queries the selected nodes (Y ). 14

Figure 2.1: Illustration of a GET lookup

Q repeats these iterations (learning, querying, and locating) until it receives KADEM- LIA RES from A, which is closest to T (it cannot find any other nodes closer to T than A). In Phase2, Q sends SEARCH REQ to nodes which are close to the key, while trying to locate more nodes near T by querying already learned nodes. Q sends SEARCH REQ to A and KADEMLIA REQ to B. After learning C, Q then sends SEARCH REQ to B and KADEMLIA REQ to C. If nodes have bindings whose key is matched with target T , they return the bindings. Notice that many bindings can be returned, especially for popular keywords. This process is repeated until 300 unique bindings are retrieved or 25 seconds have elapsed since the start of the search in Kad. 15 2.2 Evaluation of Kad Lookup Performance

Due to diverse peer behaviors and dynamic network environments, Kad adopts the data replication approach which stores binding information at multiple replica roots for reliability, load-balancing in storing and retrieving the information. However, without the help of an efficient lookup, this approach could just waste the bandwidth and storage of the nodes involved with the replication. In this section, we evaluate the performance of Kad focusing on the consistency between lookups through a measurement study. We first describe the experimental setup of our measurements. We then measure the lookup ability to locate replica roots and see how this ability affects the Kad lookup performance.

2.2.1 Experimental Setup

We ran a Kad node using an aMule client on machines having static IP addresses without a firewall or a NAT. Kad IDs of the peers were randomly selected so that the IDs were uniformly distributed over the Kad key space. A publishing peer shared a file in the following format “keywordU.extension” (e.g., “as3d1f0goa.zx2cv7bn”), where keywordU is a 10-byte randomly-generated keyword, and extension is a fixed string among all our file names, used for identifying our published files. This allows us to publish and search keyword objects of the files not duplicated with existing ones. For each experiment, one node published a file and 32 nodes searched for that file by using keywordU. We ran nodes which had different Kad IDs and were bootstrapped from different nodes in the Kad network to avoid measuring the performance in a particular key space. We repeated the experiments with more than 30,000 file names. In order to empirically evaluate the lookup performance, we define the following metrics. Search yield measures the fraction of replica roots found by a GET lookup process following a PUT operation, implying how “reliably” a node can search a desired file, and is calculated as

the number of the replica roots located by a GET lookup the number of published replica roots .

Search success ratio is the fraction of GET operations which retrieve a value for a key 16 from any replica roots located by a search lookup (referred to as successful searches), implying whether a node can find a desired object or not, and is calculated as

the number of successful searches the number of total searches .

Search access ratio measures the fraction of GET lookups which find a particular replica root, implying how likely the replica root is to be accessible (found) through lookups with a corresponding key, and being calculated as (for each replica root)

the number of searches which locate a replica root the number of total searches for the corresponding key .

For load balancing, the distribution of search access ratios among replica roots should not be skewed.

2.2.2 Performance Results

We evaluate the lookup ability to locate replica roots by measuring the search yield. Then, we show how search yield affects the Kad lookup performance by examining the search success ratio and search access ratio. Figure 2.2(a) shows the distribution of the search yield immediately after PUT operations (“found by each” line). The average search yield is about 18%, meaning that only one or two replica roots are found by a GET lookup (because the replication factor is 10 in aMule). In addition, about 80% of the total lookups locate fewer than 3 replica roots (25% search yield). This result is quite disappointing, since this means that one cannot find a published file 80% of the time when these three nodes leave the network, even though 7 more replica roots exist. Figure 2.2(b), the search yield continuously decreases over time during a day from 18% to 9% which means nodes are less likely to find a desired file as time goes by. This low search yield directly implies poor Kad lookup performance. A search is successful unless the search lookup is not able to find any replica roots (i.e., unless the search yield is 0). This is because binding information can be retrieved from any located replica root. Figure 2.2(c) shows the search success ratio over time. Immediately after publishing a file, the search success ratio is 92% implying that 8% of the experiments we cannot find a published file. This result matches the statistics in Figure 2.2(a) that 8% of searches have a 0 search yield. This result is somewhat surprising since we expected 17

1 0.18 0.17 0.8 0.16 0.15 0.6 0.14 0.13 0.4

search yield 0.12 0.2 0.11 found by each 0.1 found by all the fraction of searches (CDF) 0 0.09 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 search yield time (hours) (a) (b) 0.95 1 0.9 0.9 0.8 0.85 0.7 0.6 0.8 0.5 0.4 0.75 0.3

search success ratio 0.2 0.7 the fraction of lookups 0.1 0.65 0 0 5 10 15 20 1 2 3 4 5 6 7 8 9 10 time (hours) x-th closest replica root (c) (d)

Figure 2.2: Performance of lookup: (a) search yield (immediately after PUT) (b)search yield (over 24 hour window) (c) search success ratio (over time) (d) search access ratio (by distance) 18 that i) there exists at least 10 replica roots near the target, and ii) DHT routing should guarantee to find a published file. Even worse, the search success ratio continuously decreases over time during a day from 92% to 67% before re-publishing occurs. This degradation of the search success ratio over time is caused by churn in the network. In Kad, no other peers take over the file binding information stored in a node when the node leaves the Kad network. The mechanism to mitigate this problem caused by churning is that the publishing peer performs PUT every 24 hours for keyword objects. Because GET lookups are able to find a small fraction of replica roots, there must be unused replica roots as shown in Figure 2.2(a). In “found by all” line, 55% of replica roots are found by all lookups on average, so 45% of replica roots are never located. From this fact, we can conjecture that the replica roots found by each GET lookup are not disjointed. This inference can be checked in Figure 2.2(d) showing the search access ratio of each replica root. In this figure, nodes in the X-axis are sorted by distance to a target and we can easily see that most of lookups locate the two closest replica roots, but the other replica roots are not contacted by lookups. This distribution of the search access ratios indicates that the load of replica roots is highly unbalanced. Overall, the current Kad lookup process cannot efficiently locate more than two replica roots. Thus, resources such as storage and network bandwidth are uselessly wasted for storing and retrieving replicated binding information.

2.3 Analysis of Poor Lookup Performance

In the previous section, we showed that the poor performance of Kad lookups (18% search yield) is due to the inconsistent lookup results. In this section, we analyze the root causes of these lookup inconsistencies. Previous studies [11, 12] of Kademlia-based networks have blamed membership churn, an inherent part of every file-sharing appli- cation, as the main contributing factor to these performance issues. These studies claim that network churn leads to routing table inconsistencies as well as slow routing table convergence. These factors then lead to non-uniform lookup results [11, 12]. We ques- tion this claim and identify the underlying reasons for the lookup inconsistency in Kad. First, we analyze the entries within routing tables, specifically focusing on consistency and responsiveness. Next, we dissect the poor performance of Kad lookups based upon 19 characteristics of routing table entries.

2.3.1 Characterizing Routing Table Entries

In this subsection, we empirically characterize routing table entries in Kad. We first explain the distribution of nodes in the key space, and then examine consistency and responsiveness. By consistency we mean how similar the routing tables of nodes around a target ID are, and by responsiveness we mean how well entries in the routing tables respond when searching nodes query them. Node Distribution. Kad is known to have 1.5 million concurrent nodes with IDs uniformly distributed [11]. Because we know the key space is uniformly populated and the general size of the network, we can derive nL, the expected number of nodes that exactly match L prefix bits with the target key. Let N be the number of nodes in the 0 network and nL be the expected number of nodes which match at least L prefix bits with the target key. Then, the expected match between any target and the closest node

log2 N 0 to that target is 2 bits. nL increases exponentially as L decreases (nodes are further 0 from the target). Thus, nL and nL can be computed as follows:

0 log2 N−L 0 0 log2 N−L−1 nL = 2 nL = nL − nL+1 = 2

When N is 1.5 million, the expected number of nodes for each matched prefix length is as follows:

L 21 20 19 18 17 16

nL 0.35 0.71 1.43 2.86 5.72 11.44 0 nL 0.71 1.43 2.86 5.72 11.44 22.88

Routing Table Collection. To further study Kad, we collected routing table entries of peers located around given targets. We built a crawler that, given a target T , will crawl the Kad network looking for all the nodes close to T . If a node matches at least 16 bits with T , its routing table is polled. The number 16 is chosen empirically since there should be about 23 nodes at more than or equal to 16 bit matched prefix length in Kad (more than twice the number of replica roots). Those nodes are the ones “close” to T . 20 Polling routing tables can be performed by sending the same node multiple KADEM- LIA REQ messages for different target IDs. Each node will then return the routing table entries that are closest to these target IDs. A node’s whole routing table can thus be ob- tained by sending many KADEMLIA REQ. For every node found or polled, a HELLO REQ is sent to determine whether that node is alive. For this study, we select more than 600 random target IDs and retrieve the routing tables of approximately 10,000 distinct Kad peers. We then examine the two properties mentioned above: consistency and respon- siveness. View Similarity. We measure the similarity of routing tables. Let P be the set of peers close to the target ID T . A node Z is added to P if the matched prefix length of Z with T is at least 16 . We define a peer’s view v to T as the set of k closest entries in the peer’s routing table. This is because when queried, peers select the k closest entries from their routing tables and return them. We selected 2, 4, and 10 as k because 2 is the number of contacts returned in SEARCH REQ, 4 for PUBLISH REQ and 10 for FIND NODE.

We measure the distance d (or the difference) between views (vx, vy) of two peers x and y in P as |vx − vy| + |vy − vx| d(vx, vy) = |vx| + |vy| where |vx| is the number of entries in vx. d(vx, vy) is 1 when all entries are different and 0 when they are the same. The similarity of views to the target is defined as 1 − dissimilarity where dissimilarity is the average distance among the views of peers in P . Then, the level of this similarity indicates how similar close-to-T entries in the routing tables of nodes around the target T are. For simplicity, we call this the similarity of routing table entries. Figure 2.3(a) shows that the average similarity of routing table entries is 70% based on comparisons of all nodes in P . This means that among any two routing tables of nodes in P , close to T , 70% of entries are identical. Therefore, peers return similar and duplicate entries when a searching node queries them for T . The high similarity values indicate that the closest node has a similar view to a target with the other close nodes in P . Responsiveness. In Figure 2.3(c), we examine the number of responsive (live) contacts normalized by the total number of contacts close to a given target key. The result shows that around 80% of the entries in the routing tables respond to our requests, up to 21

1 0.8 0.9 Closest 2 Closest 4 0.7 0.8 Closest 10 0.6 0.7 0.6 0.5 0.5 0.4 0.4 0.3

fraction (CDF) 0.3 0.2 0.2 0.1

0.1 the fraction of fresh entries 0 0 0 0.2 0.4 0.6 0.8 1 5 10 15 20 consistency matched prefix length (a) Similarity among all nodes (b) Response ratio of nodes

Figure 2.3: Statistics on routing tables a matched prefix length of 15. The fraction of responsive contacts decreases as the matched prefix length increases because in the current aMule/eMule implementations, peers do not check the liveness of other peers close to its Kad ID as often as nodes further away [11].

2.3.2 Analysis of Lookup Inconsistency

In the previous subsection, we observed that the routing table entries of nodes are similar and only half of the nodes near a specific ID are alive. From this observation, we investigate why Kad lookups are inconsistent and then present analytical results. We explain why Kad lookups are inconsistent using an example, shown in Figure 2.4. A number (say k) in a circle means that the node is the k −th closest node to the target key T in the network. Only nodes located by the querying nodes are shown. We first see how the high level of the routing table similarity affects the ability of locating nodes close to T . Peers close to T have similar close-to-T contacts in their routing tables. Thus, the same contacts are returned multiple times in KADEMLIA RES messages and the number of learned nodes is small. In Figure 2.4(a), node Q learns only the two closest nodes because all queried nodes return node 1 and node 2. The failure to locate nodes close to a target causes inconsistency between lookups for PUT and GET. A publishing node only finds a small fraction of the nodes close to the target. In Figure 2.4(b), node P locates three closest nodes (nodes 1, 2, and 3) as well 22

Figure 2.4: Illustration of how a lookup can be inconsistent

as less useful nodes farther from the target T . Node P then publishes to the r “closest” nodes among these located nodes, assuming that those nodes are the very closest to the target (r = 10 but only 6 nodes are shown in the figure). Note that some replica roots (e.g. node 37) are actually far from T and many closer nodes exist. Similarly, searching nodes (Q1 and Q2) find only a subset of the actual closest nodes. These querying nodes then send SEARCH REQ to the located nodes (referred to as “search-tried”). However, only a small fraction of the search-tried nodes are replica roots (referred to as “search- found”). From this example, we can clearly see that the querying nodes will obtain binding information only from the two closest nodes (node 1 and node 2) out of 10 replica roots. We next present analytical results supporting our reasoning for inconsistent Kad lookups. Figures 2.5 shows the average number of different types of nodes at each matched prefix length for PUT and GET. The “existing” line shows the number of nodes found by our crawler at each prefix length and matches with the expected numbers pro- vided in the previous subsection. The “duplicately-learned” line shows the total number of nodes learned by a searching node including duplicates and the “uniquely-learned” line represents the distinct number of nodes found without duplicates. When a node is 23

8 8 existing existing 7 duplicately-learned 7 duplicately-learned uniquely-learned uniquely-learned 6 located 6 located 5 5 4 4 3 3 2 2 the number of nodes the number of nodes 1 1 0 0 8 10 12 14 16 18 20 22 24 8 10 12 14 16 18 20 22 24 matched prefix length matched prefix length (a) PUT (b) GET

Figure 2.5: Number of nodes at each distance from a target

included in 3 KADEMLIA RES messages, it is counted as 3 in the “duplicately-learned” line and 1 in the “uniquely-learned” line. We can see that some nodes very close to T are duplicately returned when a querying node sends KADEMLIA REQ messages. In other words, the number of “uniquely-learned” nodes is much smaller than the num- ber of “duplicately-learned” nodes when they are very close to T . For instance, there is one existing node at 20 matched prefix length (in “uniquely-learned” line), and it is returned to a querying node 5 times in PUT and 3.8 times in GET (“duplicately-learned” lines). To further compound the issue, the number of “located” nodes is half that of “uniquely-learned” nodes because, on average, 50% of the entries in the routing tables are stale. In other words, half of the learned contacts no longer exist in the network. As a result, a PUT lookup locates only 8.3 nodes and a GET lookup finds only 4.5 nodes out of the 23 live nodes which have more than 16 matched prefix length with the target. Thus, we can see that the duplicate contact lists and stale (dead) routing table entries cause a Kad lookup to locate only a small number of the existing nodes close to the target. Since the closest nodes are not located, PUT and GET operations are inadvertently performed far from the target. Figure 2.6 shows the average number of “published”

(denoted as pL), “search-tried” (denoted as sL), and “search-found” (denoted as fL) nodes for each matched prefix length L. We clearly see that more than half of the nodes 24

2 1.8 published search-tried 1.6 search-found 1.4 1.2 1 0.8 0.6

the number of nodes 0.4 0.2 0 8 10 12 14 16 18 20 22 24 matched prefix length

Figure 2.6: Number of replica roots at each distance from a target

which are “published” and “search-tried” match less than 17 bits with the target key.

We can formulate the expected number of replica roots E[fL] located by a GET lookup for each L. Let N be the number of nodes in the network and nL be the expected number of nodes which match L prefix bits with the target key. Then fL is computed as follows: pL pL E[fL] = sL ∗ = sL ∗ log N−L−1 nL 2 2

The computed values of E[fL] match with fL from the experiments shown in Figure 2.5.

From the formula, E[fL] is inversely proportional to L because nL increases exponen- tially. Thus, although a GET lookup is able to find some of the closest nodes to a target, not all of these nodes are replica roots because a PUT operation publishes binding in- formation to some nodes really far from the target as well as nodes close to the target. For a GET lookup to find all the replica roots, that is, all the nodes located by PUT, the GET operation has to contact exactly the same nodes – this is highly unlikely. This is the reason for the lookup inconsistency between PUT and GET operations. 25 2.4 Improvements

We already saw how the lookup inconsistency problem affects the lookup performance in Section 2.2. This problem limits the lookup reliability and wastes resources. In this section, we describe several possible solutions to increase lookup consistency. Then, we see how well the proposed solutions improve Kad lookup performance. Moreover, we evaluate the overhead of the new improvements.

2.4.1 Solutions

Tuning Kad parameters. Tuning parameters on Kad lookups can be a trivial attempt to improve Kad lookup performance. The number of replica roots (r = 10) can be increased. Although this change could slightly improve performance, it will still be ineffective because close nodes are not located and the replica roots that are far from the target will still exist. The timeout value (t = 3 seconds) for each request can also be decreased. We do not believe this will be useful either since this change results in more queries being sent and more duplicates being received. The number of returned contacts in each KADEMLIA RES can also be increased (β = 2 for GET and β = 4 for PUT). Suppose that 20 contacts are returned in each KADEMLIA RES. Then, 20 nodes close to a target can be located (if all contacts are alive) even though returned contacts are duplicated. However, this increases the size of messages by an order of 10 for GET (5 for PUT). Finally, the number of contacts queried at each iteration (α = 3) can be increased. This would increase the number of contacts queried at each iteration step, which thus increases the ability to find more replica roots. However, this approach will result in more messages sent and even more duplicate contacts received. Querying only the closest node (Fix1.) A solution of querying only the closest node exploits the high similarity in routing table entries. After finding the closest node to a particular target, a peer asks for its 20 contacts closest to the target. From our experimental results, a lookup finds the closest node with 90% probability, and always locates one of the nodes which matches at least 16 prefix bits with the target. Therefore, the expected search yield is 0.9 × 1 + 0.1 × 0.7 = 0.97 (90% chance of finding the closest node from Figure 2.2(d), 10% chance of not finding the closest node, and 70% similarity among routing table entries from Section 2.3). We note that this simple solution comes 26 as a direct result of our measurements and analysis. Avoiding duplicates by changing target IDs (Fix2.) Because of the routing table similarity, duplicate contacts are returned from queried nodes and this eventually limits the number of located nodes close to a target. To address this problem, we propose Fix2 that can locate enough nodes closest to a target.

Figure 2.7: Lookup algorithm for Fix2

Our new lookup algorithm is illustrated in Figure 2.7 in which peer Q attempts to locate nodes surrounding target T . Assume that nodes (A, B,..., F ) close to target T have the same entries around T in their routing tables and all entries exist in the network. We define KADEMLIA REQ by adding a target notation; KADEMLIA REQ (T ) is a request to ask a queried node to select β contacts closest to target T , and return them in KADEMLIA RES. In the original Kad, Q receives duplicate contacts when it sends KADEMLIA REQ (T ) to multiple nodes. In a current Kad GET lookup (β = 2), the only three contacts (A, B, and C) would be returned. However, Fix2 can learn more contacts by manipulating target identifiers in KADEMLIA REQ. Once the closest node A is located (i.e., Phase2 is initiated – see Section 2.1), Q sends KADEMLIA REQ by replacing the target ID with other learned node IDs ({B, C,..., F }). In other words, Q sends KADEMLIA REQ (T 0) instead of KADEMLIA REQ (T ) where T 0 ∈ {B, C,..., F }. Then, the queried nodes return contacts (neighbors) closest to themselves. In this way, Q can locate most of the nodes close to the “real” target T . In order to effectively exploit Fix2, we separate the lookup procedures for PUT and GET. These operations have different requirements according to their individual pur- poses; while GET requires a low delay in order to satisfy users, PUT requires publishing the file information where other peers can easily find it (it does not require a low delay). However, Kad has identical lookup algorithms for both PUT and GET, where a publishing 27 peer starts PUT as soon as Phase2 is initiated even when most of the close nodes are not located. This causes the copies of bindings to be stored far from the target. Therefore, we modify only a PUT lookup to delay sending PUBLISH REQ until enough nodes close to the target are located while GET is performed without delay. In our implementation, we wait one minute (the average time to send the last PUBLISH REQ is 50 seconds in our experiments) before performing a PUT operation expecting that most of the close nodes are located during that time.

2.4.2 Performance Comparisons

We next compare the performance improvement of the proposed algorithms. With the results obtained from the same experiments explained in Section 2.2, we show that our solutions significantly improve lookup performance.

1

0.8

0.6 Original CDF 0.4 Fix1 Fix2 r=20 0.2 t=1 α=6 β=20 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 search yield

Figure 2.8: Lookup Improvement (Search Yield)

Search yield can be used to clearly explain the lookup consistency problem. Fig- ure 2.8(a) shows the search yield for each solution. Simply tuning parameters (number of replica roots, timeout value, α, β) exhibit search yields of 35% ∼ 42%. Fix1 has an improvement of 90%, on average, which is slightly less than expected because some replica roots leave the network or do not respond to the GET requests. Fix2 improves the search yield to 80%, on average, but provides more reliable and consistent results. 28 For a search yield of 0.4, 99% of Fix2 lookups have higher search yields compared to 95% of Fix1 lookups. Since Fix1 relies only on the closest node, the lookup results may be different when the closest node is different (due to churn). This can be observed when a new node closer to the target churns in because it could have different routing table entries from the other nodes close to it.

1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 CDF 0.4 CDF 0.4 Original Original 0.3 Fix1 0.3 Fix1 0.2 Fix2 0.2 Fix2 r=20 r=20 0.1 α=6 0.1 α=6 0 0 40 60 80 100 120 140 160 30 40 50 60 70 80 90 Number of Messages Number of Messages (a) PUT (b) GET

Figure 2.9: Lookup Overhead

We next look at the overhead in the number of messages sent for both PUT and GET operations. The number of messages sent by each algorithm for PUT is shown in Figure 2.8(a). Fix1 and Fix2 use 72% and 85% fewer messages respectively because the current Kad lookup contacts more nodes than the proposed algorithms. After reaching the node closest to a target, the current Kad lookup locates only a small fraction of “close” nodes in Phase2 (the number of nodes found within the search tolerance is fewer than 10). Thus, the querying node repeats Phase1 again and contacts nodes further from the target until it can find more than 10 nodes within the search tolerance. The overhead for parameter tunings is higher than the original Kad implementation, as expected. Increasing the number of replica roots implies that 20 replica roots need to be found. Since it is already difficult (having to restart Phase1 ) to find 10 replica roots, it is even more difficult to find 20 replica roots – thus, the number of messages sent in PUT is much higher than for Original. Contacting more nodes at each iteration (increasing α from 3 to 6) increases the number of messages sent, and shortening the 29 timeout (from 3 to 1) incurs a similar overhead. However, we observe that the overhead is not as high as increasing the number of replica roots because when r is increased, Phase1 is restarted a couple of times – the Kad lookup process has difficulties locating 10 replica roots, thus trying to locate 20 replica roots means that Phase1 has to take place more times. The message overhead for GET operations is shown in Figure 2.8(b). Fix1 and Fix2 sent 1.45 ∼ 1.5 times more messages than the current Kad lookup. In the current Kad implementation, only a few contacts out of the learned nodes are queried during Phase2 – thus, few KADEMLIA REQ and SEARCH REQ are sent. Even if the original Kad lookup implementation is altered to send more requests, this would not increase the search yield due to the number of messages wasted in contacting far away nodes from the target because of duplicate answers. Increasing the number of replica roots to 20 uses roughly the same number of messages as Original for GET because increasing the number of replica roots does not affect the search lookup process. Increasing the number of returned contacts (α), however, does increase the number of messages sent in GET because 6 nodes are queried instead of 3 nodes (a shorter timeout has a similar overhead). The overhead due to this tweaking is even higher than Fix1 or Fix2 because our algorithms increase α only after finding the closest node.

0.9 1 Original 0.8 Fix1 0.95 0.7 Fix2 0.9 0.6 0.5 0.85 0.4 0.8

search yield 0.3 0.75

0.2 search sucess ratio Original 0.1 0.7 Fix1 Fix2 0 0.65 0 5 10 15 20 25 0 5 10 15 20 time (hour) time (hour) (a) Search Yield (b) Search Success Ratio

Figure 2.10: Lookup performance over time

Fix1 and Fix2 produce much higher performance than solutions changing parame- ters. Moreover, the overhead of these two solutions are lower than Original for PUT and 30 slightly higher for GET. The overhead for the other solutions is much higher. We next compare only these two algorithms, Fix1 and Fix2, as they are the most promising ones. Figure 2.10(a) shows that the search yield of the algorithms decreases as time goes on because of churns. However, it is still higher than the original Kad lookup. Although they show a very similar performance level, the variation of performance in Fix1 is slightly higher than Fix2 due to the possibility of a closer node churning in. Due to the high search yield, both Fix1 and Fix2 enable a peer to successfully find a desired object at any time with a higher probability than the original Kad lookup. In Figure 2.10(b), the search success ratios for our proposed algorithms are almost 1 after publishing while the ratio for the original Kad is 0.92. Even after 20 hours, the ratios for the solutions are 0.96 while the ratio for the original Kad is 0.68. Overall, Fix1 and Fix2 significantly improve the performance of the Kad lookup process with little overhead in terms of extra messages sent compared to the other pos- sible algorithms and the original one. Fix1 is simple and can be used in an environment with a high routing table consistency. The downside of Fix1 is that it is not as reliable as Fix2 in some cases. Suppose that a new node joins – and becomes the closest node, but its routing table entries close to the target are not replica roots which were routing table entries of the “old” closest node. Then, a GET operation might not be able to find these replica roots. However, a querying client can locate most of the closest nodes around a target in Fix2 even though the “old” closest node leaves the network or a joining node becomes the closest node. Therefore, Fix2 can be used for applications which require strong reliability and robustness.

2.5 Object Popularity and Load Balancing

Many peers publish or search popular objects (or keywords such as “love”) and some nodes responsible for the objects receive a large number of requests. To examine severity of this load balancing issue, we perform experiments on the lookup for popular objects in Kad network. The experiments are composed of two steps: i) finding the most of replica roots of popular objects in the network using our crawler and ii) examining the number of the replica roots located by Kad lookups. We select objects whose name match with keywords extracted from the 100 most popular items in Pirate Bay [37] on 31 April 5, 2009. We modify our crawler used for collecting routing table entries so that it could send SEARCH REQ. We consider a node to be a replica root if it returns binding information matching with a particular target keyword. Then, we run 420 clients to search bindings for the objects using those keywords.

200 200 existing existing distinctly-found distinctly-found 150 duplicately-found 150 duplicately-found

100 100

50 50 number of replica roots number of replica roots

0 0 8 10 12 14 16 18 20+ 8 10 12 14 16 18 20+ matched prefix length matched prefix length (a) (b) 200 existing real distinctly-found 150 orginal 150 duplicately-found new

100 100

50 50 number of replica roots number of replica roots

0 0 8 10 12 14 16 18 20+ 8 10 12 14 16 18 20+ matched prefix length matched prefix length (c) (d)

Figure 2.11: (a) Lookup with real popular objects (b) Original Kad lookup for our objects (c) New Kad lookup for our objects (d) Load for each prefix bit for real popular objects and our objects

We evaluate Kad lookup performance by investigating the number of replica roots located by Kad searches. First, we examine if a client was able to retrieve bindings. In the experiments, each client could find at least one replica root and retrieve binding in- formation – the search success ratio was 1. Next, we discuss if Kad lookups use resources 32 efficiently. Figure 2.11(a) shows the average number of the replica roots located by all clients at each prefix matched length. The “existing” line represents the actual replica roots observed by our crawler. The “distinctly-found” line indicates the number of the unique replica roots, but the “duplicately-found” line includes duplicates. For example, when one replica root is located by 10 clients, it is counted as 1 in the “distinctly- found” line but as 10 in the “duplicately-found” line. Overall, our results indicate that 85% of all replica roots were not located during search lookups, and, therefore, never provide the bindings to the clients. Our crawler found a total of 598 replica roots for each keyword on average. However, our clients located only 93 replica roots during the searches, which was only 15% of the total replica roots. Furthermore, we could observe a load-balancing problem in Kad lookups. Most of the “unlocated” replica roots are far from the target (low matched prefix length). At 11 matched prefix length, only 10 out of 121 replica roots were located. On the other hand, nodes close to the target were always located but received requests from many clients. At more than or equal to 20 matched prefix length (“20+” in the figure), there were only 1.4 unique replica roots (in the both “existing” and “‘distinctly-found” lines) implying that all those replica roots were located by clients. However, there were 201 “duplicate-found” roots, which means that one replica root received search requests from 141 clients, on average. To better illustrate the load-balancing problem, we define the average lookup over- head of replica roots at L prefix matched length as:

number of duplicately-found replica roots Load = L number of existing replica roots

A high LoadL value means that there are numerous nodes at matched prefix length L which received search requests. The “real” line in Figure 2.11(d) shows the load for the above experiments. The load was high for the high matched prefix length (replica roots close to the target) while the load was close to 0 for nodes far from the target (low matched prefix length). This result indicates that i) Kad is not using replica roots efficiently, and ii) the nodes closest to the target suffer the burden for most of the search requests. This problem can be explained by two factors in Kad. First, a querying node sends SEARCH REQ starting from the closest node to nodes far from the target, thus, the closest node would receive most of the requests. Secondly, due to the termination condition in 33 Kad, the search stops if 300 results (objects) are received (recall that a replica root can return more than one result). Although there are more replica roots storing the binding information for a certain object, the search process stops without contacting these replica roots because 300 objects have been returned by the few replica roots contacted. To address this load-balancing problem, we propose a new solution which satisfies the following requirements: i) balance the load for search lookups, and ii) produce a high search yield for both rare and popular objects. A description of the solution is as follows. A querying node attempts to retrieve the binding information starting far from the target ID. Suppose that querying node Q sends a KADEMLIA REQ to node A, which is within the search-tolerance for target T . In addition to returning a list of peers (containing nodes closest to T that A knows about), A sends a piggybacked bit informing Q whether it has binding information for T , that is, whether A is a replica root for T . If A sends such a bit, Q then sends a SEARCH REQ with a list of keywords to A and the latter returns any binding of objects matching all the keywords. When many replica roots publish popular objects, Q has a chance to retrieve enough bindings from replica roots that are not close to T . Thus, Q does not have to contact replica roots close to the target. This lookup can reduce the load on the closest nodes to a target with only a 1-bit communication overhead. To exploit the new lookup solution, it is important to decide where to publish objects, that is, which nodes will be replica roots. Some nodes very close to a target ID should clearly be replica roots. This guarantees a high search yield even if only a small number of nodes publish the same objects (“rare” objects) because the closest nodes are almost always found as we have previously shown. Moreover, it is desirable that nodes far from the target be replica roots so that they can provide binding information earlier in the lookup process. This lessens the burden on the load for the closest replica roots and provides a shorter GET delay to querying nodes. In the new PUT operation, a publishing peer locates most of the closest nodes using Fix2 and obtains a node index by sorting these nodes based on their distance to a target ID. The publishing node then sends 1 the i-th closest node a PUBLISH REQ with probability p = i−4 .This heuristic guarantees that objects are published to the five closest nodes and to nodes further from the target. We implemented our proposed solution and ran experiments to determine if it met 34 our requirements for both PUT and GET. The same experiments from Section 2.2 were performed with the new solution. We repeated the experiments changing the number of files to be published, but we only present experiment results similar to those of the real network when the original Kad lookups were used. We observed a search success ratio of 62% for rare objects and almost 100% for popular objects. We next looked at whether our algorithm mitigated the load-balancing problem or not. In the experiment, 500 nodes published about 2150 different files with the same keyword, and another 500 nodes searched those files with that keyword. The experiments were repeated with 50 different keywords. To show that our experiments emulated real popular objects in Kad, we tested both the original Kad lookup algorithm and our solution for comparison. In Figure 2.11(d), the “original” line shows the results obtained from using the original Kad algorithm. As expected, these results were similar to what we obtained from the real network. The number of replica roots located by our proposed Kad lookup solution is shown in Figure 2.11(c). For comparison, we present the number of replica roots located by Kad lookup without modification in Figure 2.11(b). More replica roots were found (both “duplicately-found” and “distinctly-found” lines) farther from a target than for the original Kad lookup. At 11 matched prefix length, 48 out of 101 replica roots were located using our solution while only 10 out of 91 replica roots were located using the original algorithm. The “new” line in Figure 2.11(d) shows that the load was shared more evenly across all the replica roots for our solution. At more than or equal to 20 matched prefix bit, the load decreased by 22%. In summary, our experimental results show that the proposed solution guarantees a high search yield for both rare and popular objects, and can further mitigate the load balancing problem in lookups for popular objects.

2.6 Related Work

Kad is a DHT based on the Kademlia protocol [7] that uses a different lookup strategy than other DHTs such as Chord [4] and Pastry [5]. The main difference between Chord and Kademlia is that Chord has a root for every key (node ID). When a querying node finds that root, it can locate most of the replica roots. Every node keeps track of its next closest node (successor). In Pastry [5], each node has an ID and the node with 35 the ID numerically closest to the key is in charge. Since each node also keeps track of its neighbors, once the closest node is found, the other replica roots can also be found. Thus, Chord and Pastry do not suffer from the same problems as Kad. We note that just replacing the Kad algorithm with Chord or Pastry is not a suitable solution as Kad contains some intrinsic properties, inherited from Kademlia, that neither Chord nor Pastry possesses – for example, Kad IDs are symmetric whereas Chord IDs are not. The Pastry algorithm can return nodes far from the target due to the switch in distance metrics. Moreover, Kad is widely used by over 1.5 million concurrent users whereas it was never shown that Chord or Pastry can work on large-scale networks. Since Kad is one of the largest deployed P2P networks, several studies have measured various properties and features of the Kad network. Steiner et al [10, 38] crawled the whole Kad network, estimated the network size, and showed the distribution of node IDs over the Kad key space. More recently in [39], the authors analyzed the Kad lookup latency and proposed changing the configuration parameters (timeout, α, β) to improve the latency. Our work differs in that we measured the lookup performance in terms of reliability and load-balancing, and identified some fundamental causes of the poor performance. Stutzbach et al. [11] and Falkner et al. [12] studied networks based on the Kademlia DHT algorithm by using eMule and Azureus clients, respectively. They argued that the lookup inconsistency problem is caused by churn and slow routing table conver- gence. However, our detailed analysis on lookups clearly shows that the lookup incon- sistency problem is caused by the lookup algorithm which cannot consider duplicate returns from nodes with consistent views in the routing tables. Furthermore, the au- thors proposed changing the number of replica roots as a solution. Our experiments indicate that just increasing the replication factor is not an efficient solution. We pro- pose two incrementally-deployable algorithms which significantly improve the lookup performance, and a solution to mitigate the load-balancing problem. Thus, prior work on the lookup inconsistency is incomplete and limited. Freedman et al. [40] considered the problems in DHTs (Kad included) due to non- transitivity in the Internet. However, non-transitivity will only impact the lookup per- formance in a small way since, in essence, it can be considered a form of churn in the network. We already accounted for churn in our analysis and showed that churn is only 36 a minor factor in the poor Kad lookup performance.

2.7 Summary

Distributed hash tables (DHTs) have been proposed as a solution for a distributed and scalable lookup service to allow users to find content they are searching for in P2P networks or large-scale distributed systems. We have measured the performance of the Kademlia DHT lookup in Kad that is deployed in one of the largest P2P file-sharing networks. In this chapter we solved the following problems. Search failure: Our measurement study shows 8 – 30 % of Kad lookups fail to find the desired content. This poor performance is due to the inconsistency between storing and searching lookups; only 18% of replica roots are located by searching lookups on average. We found that the Kad lookup algorithm does not work well under the feature in which routing tables are much more converged than expected for a given target. By considering the feature, we proposed two solutions: one to simply obtain lists of contacts from the node closest to the key, and the other to avoid the duplicate returns by asking peers about nodes closest to themselves instead of the target key. The new lookups can find the desired content with a probability of more than 95%. Inefficient resource use and poor load-balancing: DHTs usually store object information on multiple nodes called replica roots. When many peers search for popular information stored by many peers, 85% of replica roots are never used and only a small number of the roots suffer the burden of most requests. We found out that a lookup process fails to locate most of nodes close to a target except a few duplicate contact nodes and stores information on places where other peers are not likely to find it. In our solution, a publisher locates most of the closest nodes around a target and arranges replica roots in a way so other peers can find the replica roots before they get to hot spots. Chapter 3

Practical Network Coding

P2P applications for cooperative content distribution have recently gained popularity. Despite their success, they still suffer from inefficiency and reliability problems. To improve the distribution speed and resolve the data availability problem, many studies have considered applying network coding for content distribution. Network coding is a new type of data transmission technique which allows any nodes to encode data. In content distribution, peers no longer have to fetch a copy of each specific block. They simply ask another node to send a coded block, without specifying a block index. This technique may alleviate the block scheduling problem in a large scale P2P system. Furthermore, peers do not suffer from the data availably problem because they can reconstruct the original content after receiving enough blocks without relaying on the existence of specific content blocks. In spite of its benefits, network coding has not been widely used in real-world P2P systems. The usefulness of network coding is still disputed because of its questionable performance gains and coding overhead in practice. In this chapter, we study the practicality of network coding by measuring the per- formance and overhead of network coding in a real-world application. We also provide a new encoding scheme which can make network coding more practical. Section 3.1 introduces the basic concepts and real-world performance issues in cooperative content distribution systems and linear network coding. Section 3.2 describes our practical net- work coding system with our new novel encoding scheme and Section 3.3 provides the results of real-world tests with our implementation. Section 3.4 discusses related work and Section 3.5 summarizes the chapter.

37 38 3.1 Preliminaries

3.1.1 Cooperative Content Distribution

As a concrete real-world application for cooperative content distribution, we consider BitTorrent [13], the most popular P2P file-sharing protocol. It has been reported that BitTorrent traffic amounts to 30–80% of P2P traffic and 20–55% of all the Internet traffic as of 2008 and 2009 [41]. We describe a process of file distribution in BitTorrent which facilitates fast downloading. When a user wants to distribute a file, the user’s client divides the file into smaller pieces. It then creates metadata (called a torrent file) which includes information such as the name of the content being shared, its total size, the hashes of the pieces, and the address of a tracker. The tracker is a central node which keeps a list of peers participating in distribution of the file (called a swarm). A user who wants to download the content first fetches the torrent file and contacts the tracker. During this bootstrap process, the tracker responds with a list containing a subset of peers in the swarm. Peers exchange BitField and HAVE messages to indicate which pieces the peers have. A BitField message includes a series of bits mapped to pieces each peer has, and peers send HAVE messages whenever they complete receiving a new piece. Based on these messages, a peer knows which nodes have its missing blocks and asks them to send the blocks.1 According to the downloading status, peers are grouped into two types: seeders and leechers. Seeders own a complete copy of the content and share it, while leechers have either no pieces or an incomplete set, and are unable to reconstruct the content without additional pieces. For fairness, BitTorrent systems enforce a tit-for-tat exchange of content between peers. A peer typically chooses a recipient of content based on the upload rate of other nodes. This policy encourages uploading and discourages free-riding which downloads content without uploading data to other nodes.

3.1.2 Random Linear Network Coding

In this paper, we consider the popular linear network coding design [25], which is simple to implement and has been proven to achieve maximum throughput. In this scheme, a

1 In this thesis, a piece refers to a part of content and a block describes the data that is exchanged between peers. 39

file is divided into m pieces, each represented as n elements in a finite field Fp of size p, n where p is prime. Then, i-th piece can be considered as a vector u˜i = (ui,1, . . . , ui,n) ∈ Fp and the file becomes a sequence of vectors (u˜1,..., u˜m). When the original content source performs encoding, with respect to a vector of coefficients (α1, . . . , αm) (referred to as the encoding vector), it computes an information vector, or a linear combination of u˜1,..., u˜m,

n X w˜ = αiu˜i = (w1, . . . , wn). (3.1) i=1 (The choice of encoding vectors depends on the type of coding, and can be a global parameter, but in random linear network coding each node independently chooses en- coding vectors randomly.) The source then sends the encoding vectors and information vectors together in the augmented block (or simply block) with the augmented form of w = (α1, . . . , αm, w1, . . . , wn). Any node, when requesting an augmented block from a P peer, sends a linear combination j βjwj of its received blocks w1,..., wl. A receiver can decode the original file after receiving at least m linearly independent blocks. Let W be the information vectors of received blocks and A be the matrix whose rows are the encoding vectors of received blocks. The receiver can recover all original pieces of file U by solving the linear equation W = AU.

3.1.3 Performance and Overhead in Network Coding

Benefits of Network Coding

Despite their advantages and popularity, cooperative content distribution systems in- cluding BitTorrent pose significant challenges. A key challenge is block scheduling, which affects the distributing speed. To illustrate this problem, consider the following exper- iment, in which we used BitTorrent clients (CTorrent [26]) to distribute a 32MB file with 128 pieces to 95 nodes in PlanetLab [29]. All nodes joined the distribution session at the same time and limited their upload bandwidth to 100KB/s. They left the session immediately after they finished their download. In Figure 3.1(a), a point at (x,y) means that node y downloaded a block at the x second mark after the start of the experi- ment. We observe many gaps in the figure, which means there were many small time 40 periods when nodes did not download anything. This eventually delayed their down- loading completion time. There were peers waiting for their turn to download missing blocks from others. Some peers endured especially long wait times toward the end of their download, when they had obtained most, but not all, of the blocks. An average download completed in 634 seconds.

80 80

60 60

Nodes 40 Nodes 40

20 20

0 0 0 200 400 600 800 0 200 400 600 800 Downloading time Downloading time (a) Downloads in BitTorrent (b) Downloads in network coding

Figure 3.1: Comparisons of downloads between BitTorrent and network coding

Figure 3.1(b) shows that network coding makes efficient content propagation easier. In this experiment, we used the same parameters except that BitTorrent clients used network coding2 . The inherent benefit of network coding is that it does not use pre- determined block indexes. Since coded blocks have no “identity,” peers no longer have to fetch a copy of each specific block. Therefore, they simply request its upstream peer to send a coded block that is combinations of blocks it already has. Adding network coding dramatically reduces periods of idle time, reducing average download time to 407 seconds. Figure 3.2 compares downloading times of BitTorrent and network coding with an ideal downloading time in the experiments. With the fixed upload rate, the ideal down- loading time will be depend on arrivals and departures of peers. However, in the given experiments, we can assume the ideal downloading is the case where a single client stably downloads content from a server at a download rate of 100KB/s without competition. We simply compute this time by dividing the file size with the upload rate. The Y-axis in

2 More specifically, we used the optimal encoding which will be explained in the next subsection 41

2.5 BT NC 2

1.5

1

0.5 Ratio of download time

0 32MB 64MB 128MB 256MB File size

Figure 3.2: Comparisons with the optimal downloading time

the figure represents the ratio of downloading times to the ideal downloading time. We observe that the downloading time of the network coding-enabled system is much closer to the ideal downloading time than the non-coding BitTorrent system. This means that network coding can provide near-optimal scheduling that reduces downloading times of peers.

Tradeoff in Network Coding

With the use of network coding, we face an important issue: how to encode data at each node in a network. There could be many ways for nodes to select and combine P received blocks w1,..., wl to produce an outgoing block j βjwj. To deal with this issue, we mainly consider two criteria: encoding overhead and block dependency. One obvious method would be to use all blocks a node has and combine them using coefficients βj chosen uniform randomly from the field Fp [25]. This is the usual random linear network coding, but in this thesis we call this the full coding in order to distinguish it from other schemes. It is shown [42] that the full random linear network coding used for P2P systems achieves maximum possible throughput. However, it is impractical for use at “line speed,” where a node would produce an encoded block in real time, when requested by another node, due to the CPU overhead or disk access. To reduce the encoding overhead, peers may generate a coded block with fewer input 42 blocks. The encoding overhead can be ameliorated by splitting the file into a small number of generations. Full coding then only needs to be performed on each generation, reducing the number of blocks that must be encoded for every request. This encoding scheme is referred to as gen coding in this thesis. Another alternative to the full random coding is the sparse random linear network coding, where a node randomly selects up to Pk k blocks it has received, wi1 ,..., wik , forms a random linear combination j=1 βjwij and sends it to an other node. The encoding scheme which combines k blocks is referred to as k-coding in this thesis. While this scheme reduces the encoding overhead, a small k generates unnecessary dependent blocks. Suppose a node has already received m0 independent blocks, and it has just received a new independent block w. Then this node has new information about the file (namely w), but an outgoing block from the node would contain w in its linear combination with probability k/(m0 +1). This means that, initially when m0 was small, the new information w can be propagated through outgoing blocks with high probability, but as m0 grows, especially when m0 ≈ m, such probability is only ≈ k/m. There is a clear trade-off between the value of k (and thus CPU overhead) and the bandwidth utilization (goodput) in the network. To show the usefulness of network coding in the previous subsection, we introduced so-called optimal coding. This coding fully combines all coefficient vectors of blocks a peer has. Because this coding scheme does not encode information vectors in blocks, its encoding overhead is negligible. However, it has the same level of low block dependency to the full coding. We consider the coding has an optimal performance we can obtain in practice and we will use this coding to compare the performance of other schemes. Is there an encoding scheme to satisfy both requirements of low encoding overhead and low level of block dependency? Figure 3.3 demonstrates the trade-offs between these requirements by comparing sparse coding schemes with various values of k, full coding, and gen coding. White bars represent CPU usage and crosshatched bars represent the relative amount of dependent blocks produced compared to optimal coding. Note that, k-coding with a small k generate many dependent blocks, and k-coding with a large k and the full coding demand more CPU cycles. 3 . The graph also includes a novel network coding scheme design, i-Code. This coding combines the benefits of sparse coding causing small overhead and full coding which generates few dependent blocks.

3 In Section 3.3, we give more detailed explanation about this graph. 43

30 Overhead 11 25 Dependency 9 20 7 15 5 10 3 CPU usage (%) 5 1

0 Ratio of Dependent blocks k-1 k-2 k-4 k-8 full gen k-16 k-32 k-64 k-128 k-256 iCode Encoding

Figure 3.3: Tradeoff between CPU overhead and block dependency

We will describe this lightweight and efficient encoding scheme in the next section.

3.2 Practical Network Coding System

Network coding has been proposed to improve the performance of content distribution systems. However, almost no real-world system use network coding for large scale content distribution over peer-to-peer networks. One of the obstacles to the use of network coding in practice is encoding overhead because nodes in networks need to combine several blocks for sending encoded data. This issue on encoding overhead becomes more serious when we consider resource constraints placed on real-world clients. Real-world clients have limited memory, processing capacity, and slow disks. These factors will affect the overall performance of network coding operations. By considering these constraints, we propose a practical network coding system with a lightweight and efficient coding scheme, which achieves both low encoding overhead and a low level of block dependency 44 3.2.1 System Architecture

Here we provides the overview of our network coding-enabled system for content distri- bution. Figure 3.4 shows modules in each peer participating in a distribution session. With the help of the neighbor manager module, peers maintain an overlay network for content distribution and obtain information on data others have. Using the scheduler module, a peer decides which nodes it uploads to or which nodes it downloads from . Peers exchange encoded data in the form of augmented vectors and test the usefulness of received blocks by using the dependency checker. The encoding module generates outgoing blocks and the decoding module reconstructs the original content. In our sys- tem, the majority of blocks a peer has reside on the disk of the peer and they must be read into memory as needed by considering resource constraints. This is because peers may not have enough memory to cache an entire blocks in RAM, especially in modern P2P networks where multi-gigabyte files are common. a compact memory footprint is a common requirement in all systems. Below we describe functions of each module.

Figure 3.4: System architecture

The neighbor manager module provides lists of peers participating in a distribution session. These lists can be obtained from a tracker or other peers. A peer establishes 45 links to some peers (called neighbors) and maintains connections by periodically ex- changing messages. Therefore, this module can detect arrivals and departures of the neighbors. The neighbor manager also provides information about what each peer has. In BitTorrent-like systems, peers exchange bitfields, series of bits are mapped to pieces each peer has. However, network coding does not use predetermined block indexes, and peers thus exchange ranks of their coefficient matrices composed of encoding vectors in blocks of peers. In P2P systems not using network coding, clients should make decisions on block scheduling. However, network coding does not have to decide specific blocks to download, which is an inherent benefit of network coding. Instead, the scheduler module only decides neighbors with whom a peer exchanges data. To select those neighbors, the scheduler infers whether a neighbor has useful blocks or not. We use a simple heuristic assuming a neighbor has useful blocks unless it sends multiple useless blocks. For such a neighbor, the peer asks not to send blocks until the neighbor receives new independent blocks from other nodes. The scheduler module also determines to whom a peer uploads data. We simply follow the tit-for-tat policy in BitTorrent-like systems. In linear network coding, the usefulness of data is determined by linear dependency. The dependency checker module maintains a coefficient matrix which consists of encod- ing vectors of blocks in a peer’s local store. When peer A is about to receive a block from peer B, it first performs a linearity check using the encoding vector and the coefficient matrix. The module uses Gaussian elimination to verify whether the block is linearly independent on the blocks A already has. If it is dependent, the received block will be dropped. To avoid resource waste due to transmission of linear dependent blocks, we use the following dependency check strategy. Once peer B decides to upload a block to A, peer B first encodes a new block and sends A only the encoding vector. Upon receiving the encoding vector, peer A checks if the vector is linearly independent on the blocks it has. If it is, A requests B to send the full block and B transmits the remaining part (i.e., information vector) of the block. If not, A informs B that a message informing the coded block is dependent. This strategy greatly reduces wasted bandwidth since the size of encoding the vector is much smaller than an entire coded block. The decoding module enables a peer to reconstruct the original content upon receiv- ing m independent blocks where m is the number of pieces in the content. This module 46 solves a linear equation for decoding by using Gauss-Jordan elimination. Because de- coding is time-consuming work, we may consider progressive decoding: a peer decodes blocks as it receives them without waiting for receiving all m independent blocks. With this strategy, the decoding time overlaps with time to downloading content, reducing the total time to obtain the original content. However, this strategy is not feasible in practice due to the previously mentioned resource constraints: the majority of blocks a peer has are not allowed to reside on the memory but rather they are loaded into the memory from the disk, as needed. This requires a large number of disk operations. In- stead we consider another approach with multiple generations. Let m be the number of blocks and n be the dimension of blocks. Because the decoding process requires O(mn) multiplication operations, the decoding time can be reduced to 1/x compared to 1 gen- eration by using x generations. Furthermore, once a node downloads one generation, it can start decoding the generation while receiving blocks in other generations. Therefore, decoding time overlaps with downloading time like progressive decoding. Finally, the encoding module generates outgoing blocks using random linear network coding described in Section 3.1. A peer chooses a subset of blocks and linearly combines them with randomly selected coefficients in Galois Field GF (28). This module may have several submodules to use different types of encoding schemes such as full coding, sparse coding, and gen coding as well as our novel encoding scheme for which details will be described in the next subsection.

3.2.2 i-Code: Lightweight and Efficient Coding

We propose a novel encoding scheme which we call i-Code, the contraction of ‘incre- mental encoding.’ This scheme is lightweight and efficient. It loads only one block to be read from the disk and mixes only two blocks for every encoding operation. Although this scheme combines a small number of blocks, it does not impose the dependent block penalty faced by the sparse encoding. The key idea behind i-Code is to approximate full encoding by maintaining a block of “well-mixed” encoded data called an accumulation block which contains a mix of blocks in the peer’s local store (see Figure 3.5). When peer A is about to receive a block, it first performs a linearity check, using the encoding vector, to verify that the incoming block is linearly independent from blocks that A already has. If it is not, then the block will be dropped. Otherwise, the newly 47

Accumulation block

Block Dependency New block check ...

local disk

Figure 3.5: i-Code design. Our encoding scheme requires only one block to be read from disk and one linear combination, greatly reducing encoding overhead.

received linearly independent block w is stored to the disk, and the accumulation block a is updated by a ← αa+βw, for randomly chosen coefficients α, β ∈ Fp. When sending a block, A reads a single block wi from the disk and computes αa + βwi which is a random linear combination of the accumulation block and wi. It also updates a similarly as a random linear combination with wi. In i-Code, blocks the peer has are “accumulated” to this “representative” block. Mathematically, the accumulation block is updated to be a ‘generic’ random block of the subspace spanned by blocks which a peer has. When outputting a coded block, the peer selects one block and combines it with the accumulation block. Because the many blocks are already combined with the accumulation block, mixing these two blocks has similar effects of the full coding which combines all blocks a peer has. As an example of the effects, a newly received block can be immediately combined to any outgoing blocks because it is already linearly combined to the accumulation block. This reduces the probability of encoding dependent blocks. In sparse k-coding, on the other hand, a newly received block may not be used for encoding a new outgoing blocks, which may encode dependent blocks with a higher probability. i-Code has strategies for choosing blocks in the disk in a way that reduces the prob- ability of generating dependent blocks. Suppose that peer A is sending blocks to other nodes including peer B. First, the peer chooses a block which is “not” accumulated to 48 the accumulation block. When such a block is linearly combined with the accumulation block for encoding, the outgoing block is likely to be independent on blocks A already has generated or blocks B has. To record whether a block is accumulated to the accumu- lation block or not, i-Code uses a small-size of data structure like bitfield in BitTorrent. The overhead needed to store the information is almost negligible. Second, the choice of wi depends on the recipient B and the previous history: A chooses a random block wi which was not received from B or was not already used to produce an output block sent to B, in order to avoid sending a linearly dependent block as much as possible. Such information is also stored in a Bitfield-like structure. Because BitTorrent clients also maintain Bitfield to record which pieces neighbors have, i-Code does not require additional memory space for the information compared to BitTorrent. The difference is that i-Code uses this record to decide which block can be upload instead of deciding which pieces can be downloaded. Even using i-Code, peer A may encode and send dependent blocks to peer B in some cases. Let VA and VB denote subspaces spanned by blocks A and B have, respectively.

In the first case, VA may be a subset of VB. Then, any coding scheme including the full coding should generate blocks which is dependent on blocks peer B has. The second case requires that two conditions meet at the same time: i) after A sends a block to B, a new independent block is not combined with the accumulation block and ii) A might pick a block which belongs to VB. This case is very rare in real-world applications where each peer keeps downloading blocks from many neighbors and uploading blocks to other nodes. The chance of consecutively sending to the same peer without updating the accumulation block is low. Moreover, we also keep the history of blocks sent and received from B to mitigate the probability of unnecessary dependent blocks occurring as much as possible. In practice, this results in a very low level of dependent blocks occurring, comparable to the full coding, which we will empirically show in the next section. Our proposed design has the following benefits:

• Low disk access overhead: since only one block must be read from disk per encod- ing, the proposed scheme requires a much smaller amount of disk access, and is thus faster. 49 • Low computational overhead: a peer needs to perform only one linear combination per encoding operation.

• Low linear dependency: the encoding scheme achieves dependent block levels com- parable to the full coding.

3.3 Evaluation

Our goal is to test the practicality of network encoding schemes and compare the per- formance and overhead of network coding with real-world applications. For such an application, we choose BitTorrent and implemented a network coding-enabled BitTor- rent system by modifying Enhanced CTorrent 4 to use network coding schemes. With this system, we will show the effectiveness of our proposed encoding scheme through empirical experiments. Specifically, we will present how our encoding scheme can meet requirements of low encoding overhead and the low level of block dependency. We will then explore how much network coding improves or degrades the performance in real- world content distribution. Here we summarize the experimental environments. We performed experiments in two networks: our local campus network and a wide area network using PlanetLab. We mainly tested the network coding system for distributing files to 100 – 200 nodes in PlanetLab. Nodes in this wide area network were not reliable and sometimes showed unexpected behavior, which is common in P2P networks. Therefore, we also measured the performance and overhead of the system in our local testbed and compared them with results from PlanetLab experiments. The local testbed was composed of 30 nodes equipped with a 1 GHz Dual-Core AMD Opteron(tm) Processor and 2GB DDR2 RAM.

3.3.1 Comparisons of encoding schemes

To evaluate our encoding scheme called i-Code, we are mainly concerned with down- loading time, encoding overhead, and linear dependency. With these metrics, we com- pare the performance and overhead with exiting schemes: full encoding, sparse encod- ing , and gen coding which uses full encoding with multiple generations. The sparse

4 A simple BitTorrent client written in C++ [26] 50 encoding which combines x blocks is denoted as k-x coding and gen coding using x generations is represented as gen-x. For the purpose of comparison with other studies, we followed the same experimental setup used in [18]: 100KB/s upload bandwidth limit and 256KB piece size.

Tradeoff revisited

We already presented the overhead and linear dependency of blocks introduced by each encoding scheme in Figure 3.3 of Section 3.1. The figure clearly showed that i-Code satisfies both requirements of low overhead and a low level of block dependency while other schemes cannot. For the same result, we present different views to discuss the details.

700 7 Leecher 600 Seeder 6 500 5 400 4 300 3 200 2 Time (seconds) 100 1 Message overhead (MB) 0 0 k-1 k-2 k-4 k-8 k-1 k-2 k-4 k-8 full full gen gen k-16 k-32 k-64 k-16 k-32 k-64 k-128 k-256 k-128 k-256 i-Code i-Code optimal optimal (a) Time overhead (b) Message overhead

Figure 3.6: Comparison of overhead

Figure 3.6(a) shows encoding time introduced by each encoding scheme. The re- sults were obtained from the experiments for distributing a 128MB file where all nodes joined network at the same time and left immediately after completing download. In sparse coding schemes, encoding overhead increased as the number of combined blocks increased. Full coding spent much time for combining blocks. In particular, a seeder combined 512 blocks whenever it transmitted an outgoing block. Thus the seeder spent 665 seconds where the average downloading time was 1683 seconds. Instead of presenting the number of dependent blocks again, we provide the actual message overhead due to block dependency in Figure 3.6(b). The message overhead 51 was proportional to the number of encoding vectors which were dependent on blocks a receiver had. With BitTorrent protocol, we counted bytes of following transmitted messages: PIECE packets including dependent coefficients and CANCEL packets reporting that a receiver does not want a received block. The sizes of these packets were 591 bytes and 93 bytes, respectively, for a 128MB file consisting of 512 blocks. As the number of combined blocks decreased, the message overhead increased. In k-2 coding, a peer had difficulty in obtaining independent blocks while receiving 5MB of messages containing useless coefficients. We observed two exceptions: gen coding and full coding. gen coding had the smallest message overhead although it had more dependent encoding vectors than other schemes. In this experiment, we used 8 generations as used in [19]. Thus, the size of an encoding vector was 1/8 of other schemes, which led to small message overhead. In full coding, the linear dependency of blocks was higher than some sparse codings such as only mixing 64 blocks. Because a seeder could not provide “fresh” blocks into the network on time and the number of independent blocks leechers had did not increase quickly. The leechers also tried to download blocks among “stale” blocks, so the scheme produced many dependent blocks.

3500

3000

2500

2000

Time (seconds) 1500

1000 k-1 k-2 k-4 k-8 full gen k-16 k-32 k-64 k-128 k-256 i-Code optimal

Figure 3.7: Comparison of downloading time

Figure 3.7 provides downloading time by each encoding scheme. From the above results on computational overhead and linear dependency, it was obvious that the full coding or the sparse coding with the small number of combination delayed the down- loading time of peers. Thus we chose schemes gen-8 [19], k-10 [18], and i-Code and measured the overhead according to inter-arrival times. 52 Detailed analysis on schemes

payload encoding 100 coeffcient encoding dependency check 80 60 40 Time (seconds) 20 0 k-10 k-10 k-10 gen-8 gen-8 gen-8 i-Code i-Code i-Code 0 15 30 Inter-arrival time

Figure 3.8: Detailed comparison of time overhead

For encoding for a 128MB file consisting of 512 blocks, a seeder combines 64 blocks with gen-8 coding and combines 10 blocks with k-10 coding. Because i-Code has en- coding overhead time to combine two blocks approximately, i-Code’s encoding overhead in a seeder is 1/64 of gen-8 coding and 1/10 and k-10 coding. Figure 3.8 shows that a leecher with i-Code suffered the least computational overhead in total. The overall computational overhead of i-Code was 10% of gen-8 coding and 26% of k-10 coding on average with any inter-arrival times. Furthermore, gen-8 coding had smaller overhead in checking block linearity. The block dependency is determined by performing Gaussian elimination and the overhead is O(m2 × n) where m is the number of blocks and n is the dimension of a block. By using multiple generations, gen coding can reduce m, which leads to reducing the overhead for dependency checking. However, the dominant overhead is time to combine information vectors and i-Code has much smaller overhead for encoding by combining only two blocks: one for the disk and the accumulation block. i-Code also satisfies the requirement of a low level of block dependency. Figure 3.9(a) shows the number of received dependent encoding vectors according to inter arrival times. i-Code produced a smaller number of dependent encoding vectors than k-10 or gen-8 encoding schemes. Its block dependency was similar to that of optimal coding. When nodes joined the network at a different time, i-Code induced almost no dependent encoding vectors. Figure 3.9(b) shows the number of dependent encoding vectors when 53

10000 1000 k-10 gen i-Code 800 optimal 1000 600

400 100 k-10 Dependent blocks 200 gen-8 i-Code

Number of dependent blocks optimal 10 0 2.5 5 7.5 0 10 20 30 40 50 60 70 80 90 100 Inter-arrival time (seconds) Joining order (a) Inter-arrival time (b) Order

Figure 3.9: Comparison of block dependency peers joined every 5 seconds. Peers using i-Code did not generate dependent blocks ex- cept when they joined the network only at the beginning of a session. At the beginning of the session, the source did not transmit many blocks into the network. Accordingly, peers formed cycles and tried to download blocks which were dependent on what they had. However, peers using k-10 produced 600 – 1000 dependent encoding vectors re- gardless of time when they joined the network.

1500 k-10 gen i-Code 1450 optimal

1400

Time (seconds) 1350

1300 2.5 5 7.5 Inter-arrival time (seconds)

Figure 3.10: Detailed comparison of time overhead

In sum, i-Code had the overhead equivalent to that of k-2. However, its block de- pendency was close to that of the optimal encoding. With smaller overhead, i-Code therefore provided the shorter downloading time than other encoding schemes as shown in Figure 3.10. 54 3.3.2 Practicality Check

Through this section, we will answer how much real-world applications benefit from net- work coding. To evaluate the performance of network coding in a real-world application, we compared the performance measured in a non-coding system and a network coding- enabled system. For simplicity, let BT and NC denote the non-coding system and the coding system, respectively. As a non-coding system, we again used CTorrent, a real- world BitTorrent client. As a network-coding enabled system, we used iTorrent, CTor- rent using network coding equipped with i-Code. We conducted experiments mainly us- ing two experimental setups; one was used for comparing results with previous studies and the other was used for reality check in practical content distribution applications.

Simple comparison with the same block size

BT 4000 BT 12000 NC NC 1:1 1:1 9000 3000

6000 2000

Time (seconds) 3000 Time (seconds) 1000

0 0 32 64 128 256 512 1024 32 64 128 256 File size (MB) File size (MB) (a) Local testbed (b) PlanetLab

Figure 3.11: Comparisons of downloading times with 256KB block size

We first compared the performance of the content distribution systems with the experimental setup used in [18]; the block size was 256KB and upload bandwidth was limited to 100KB/s. Figure 3.11 presents downloading finish times when peers joined a distribution session at the same time and left the session immediately after completing the download. We also put “1:1” downloading as an ideal case in which a client directly downloaded content from a server without any interference of other nodes. Overall, NC reduced the downloading time about 18 – 40% compared to BT. As described in Sec- tion 3.1, finding the optimal data propagation scheduling that minimizes downloading time in a distributed way is very difficult. However, the result shows that NC provided 55 the near-optimal downloading time. This was especially true when the file size was large. When many nodes joined the session at the same time, most nodes should wait for downloading initial blocks from a source of other peers. The portion of this initial waiting time to the downloading time became smaller as the content size became larger. Therefore, we observed that the downloading time of NC approached the ideal finish time as the file size increased. Figure 3.12 compares downloading times measured in various environments. The “LT” label represents our local testbed and the “PL” label represents PlanetLab environment. For illustration, the Y-axis in the figure shows the ratio of downloading times for 1:1 downloading time. Notice that the difference between downloading times of BT and NC was longer in PlanetLab where nodes showed more dynamic and unexpected behavior.

Setting up experimental parameters

2.2 BT-PL 2 BT-LT NC-PL 1.8 NC-LT 1.6 Ratio 1.4 1.2 1 32 64 128 256 File Size (MB)

Figure 3.12: Comparisons of downloading in different environments time with 256KB block size

We next considered a new experimental setup for content distribution systems for practical use. We first set parameter values for experiments based on our empirical tests and previous studies. For our new setup, the upload bandwidth was limited to 80KB/s that was the median upload capacity measured in [43]. Each peer was connected to at most 50 neighbor nodes and exchanged messages, which is the default value at CTorrent [26]. Previous studies [19, 18] used 256KB for a block size to distribute files with sizes between 32MB and 512MB for both coding systems and non-coding systems. 56 However, we empirically selected 64KB for the non-coding system and 128KB for coding systems in order to distribute a 128MB file. We describe how we determine the block size. A block size largely affects the performance of content distribution systems. However, determining an appropriate block size is not trivial due to tradeoffs. A large block size delays the downloading because blocks are propagated slowly and each peer has difficulty in getting the first several blocks at the beginning of a distribution session. A small block size introduces more overhead than a large block size. In BitTorrent, the sizes of a metadata file and a BitField message are proportional to the number of blocks. Each time a piece is received, a HAVE message is sent to all neighbors to let them know it has that block. The more blocks, the more bandwidth consumed. In network coding-enabled systems, if the block size is small, the sizes of encoding vectors will be large, which leads to more message overhead. In addition, the small block size also introduces a large coefficient matrix and thus overhead for dependency check and decoding overhead will become large.

2000 BT 1600 NC

1200

800

Time (seconds) 400

0 32 64 128 256 512 1024 Block Size

Figure 3.13: Impact of a block size on downloading times

Figure 3.13 shows the downloading times corresponding to the block sizes. We re- peated the experiments conducted in [18] where the file size was 64MB and the block size varied from 32KB to 1024KB. Unlike [18] where both systems took minimal down- loading time with a block size of 128KB or 256KB, we did not observe a “sweet spot” of block sizes in our experiments. In both systems, the downloading time generally in- creased as block size increased. The benefit from a small block size overwhelmed the overhead due to a small block size. However, the difference caused by block size became 57 smaller as an inter-arrival time of peers was large, which is not shown in the figure. For a similar reason, Vuze [44] (previously called Azureus) recommends around 1000-1500 pieces per file in [45]. By considering the described tradeoffs introduced by the block size, we repeated experiments to find an appropriate block size and selected 2048 pieces for the non-coding system and 1024 blocks for the coding system for the 128M file.

Performance comparison

1

0.8

0.6

CDF 0.4

0.2 NC BT 1600 1800 2000 2200 2400 2600 Downloading time (seconds)

Figure 3.14: Downloading times with flash-crowd

With the new experimental setup, we conducted experiments and compared down- loading times. Because we obtained similar experimental results, we mainly present the result obtained by distributing 128MB content to 100 PlanetLab nodes. Figure 3.14 shows downloading finish times when peers joined the distribution session at the same time and left the session immediately after finishing downloading (flash-crowd). With the new setup, the coding system does not provide performance improvement as much as using the old setup in [19, 18]. NC reduces the downloading time about 13 – 17% compared to BT with the new setup while 18 – 40 % with the old setup. We may explain this discrepancy by using the result shown in Figure 3.13; the performance improvement due to network coding becomes higher as the block size increases. We investigated the impact of peer arrivals and departures on the performance. In Figure 3.15(a), peers arrived at different times and left the session as soon as they fin- ished downloading. The figure shows the average downloading times with inter-arrival times. Network coding provided shorter downloading time, but the improvement be- came less as the inter-arrival time increased. Peers could get data relatively easily and 58

2100 1700 BT BT 2000 NC NC 1600 1900 1800 1500 1700

Time (seconds) Time (seconds) 1400 1600 1500 1300 0 15 30 45 60 15 30 60 120 240 480 960 Inter-arrival time (seconds) time (seconds) (c) Joining interval (d) Seeding time

Figure 3.15: Downloading times according to peer arrivals and departures scheduling became easier because there were already enough blocks in the network when peers joined the network at different times. In Figure 3.15(b), peers arrived every 15 seconds and stayed for a given seeding time after they completed download. When there were many seeders in the session, peers could easily download from those seeders and they were less likely to suffer from the scheduling problem or the rare blocks problem. NC provided an improvement of 1.5 – 3.2% and the performance difference between two systems decreased as the seeding time increased.

BT 100 NC 80 60 40 20

Completed downloads (%) 0 1800 2000 2200 2400 2600 2800 3000 Source departure time (seconds)

Figure 3.16: Completeness according to the time of source departure times

Robustness comparison

Network coding provides much better robustness than the non-coding system in some situations including node departures, the limited number of neighbors, and heteroge- neous capacities. During a distribution session, some content blocks may become rare 59 and peers which are missing the blocks need to wait for their turns in a long line to receive those blocks. Even worse, those blocks may disappear in the network, and peers are not able to reconstruct the original file. This problem may occur because of uneven data propagation and dynamics of nodes including the source. To introduce the prob- lem into the distribution session, we conducted similar experiments to those in [15, 18]. The source sent out data for a given time and left the session. Peers arrived every 30 seconds and left the session immediately after finishing downloading. Figure 3.16 shows the fraction of nodes which completed downloading. Recall that it took 1638 seconds, on average, to send whole content at a upload rate of 80KB/s. When the source left the session at 1800 seconds, it was enough time to send the entire content. However, we did not find any nodes which completed downloading with this departure time. As the source transmitted data for a longer time, this problem became gradually ameliorated. On the other hand, the network coding system did not rely on specific blocks of data and did not suffer from the problem. Thus, more than 95% of the nodes finished down- loading regardless of the departure time.

1

0.8

0.6

CDF 0.4

0.2 NC BT 400 600 800 1000 1200 1400 Downloading time (seconds)

Figure 3.17: Downloading times of fast nodes in heterogeneous environments

In P2P content distribution systems, nodes generally have non-homogenous capabil- ities. We studied the performance of content distribution systems in heterogenous envi- ronments. For experiments, we simply grouped 200 PlanetLab nodes into two parties: 100 fast nodes with a 420KB/s maximum upload rate and a 4MB/s maximum down- load rate, and a 100 slow nodes with an 80KB/s maximum upload rate and a 1MB/s maximum download rate. A source belonging to the fast group distributed a file to 200 nodes. Figure 3.17 shows the downloading finish times experienced by fast peers when 60 they interacted with other slow nodes. We observed that with NC the downloading times of the fast nodes were 37% better than BT on average. Without network coding, it took a long time for fast nodes to download their missing blocks especially when slow nodes had those blocks. In contrast, with network coding, the blocks that propagate in the network are linear combinations of many other blocks. Thus, fast nodes had better chances of finding useful blocks from other fast neighbors and making quick download- ing progress.

1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4

0.2 BT-50 0.2 NC-50 BT-20 NC-20 1400 1600 1800 2000 2200 2400 1400 1600 1800 2000 2200 2400 Downloading time (seconds) Downloading time (seconds) (a) BitTorrent (b) Network coding

Figure 3.18: Impact of number of neighbors

We also examined the impact the number of neighbors connected to each peer. In the distribution session, peers form an overlay network and exchange information on peers or data pieces which they have. If a peer is connected to too many neighbors, it consumes too much resource to maintain connections and to exchange messages such as “keep-alive” and Bitfield-like information. Thus, peers commonly limit their number of connected neighbors. In [15, 19, 18] each peer is connected to at most 10 – 20 neighbors while the default numbers of maximum neighbors in CTorrent and Mainline5.3 [26, 9] are 50 and 80, respectively. Figure 3.18 shows the download finish times measured in the experiments with different maximum numbers of neighbors. Each peer in BT was connected to at most 50 neighbors in the “BT-50” experiments and 20 neighbors in the “BT-20” experiments. In BT-20, peers exchanged with a smaller number of neighbors than the BT-50 and they had limited information on which blocks other peers have in the network. Because they had to perform block scheduling based on limited information, their downloading times were longer than BT-50. However, the coding system does not suffer from the block scheduling problem because it does not have to schedule which 61 blocks to download. NC showed shorter downloading times compared to BT regardless of the number of neighbors as shown in NC-20 and NC-50 lines.

Overhead and system parameters

We examined how much network coding introduces overhead in order to provide such improved performance and robustness. Recall again that it took 1638 seconds, on aver- age, to download 128MB content from other nodes with an 80KB/s upload rate using network coding. The total average time to encode outgoing blocks in each peer was less than 10 seconds (less than 1% of the downloading time) including the disk access time. Message overhead for delivering coefficient vectors was about 1MB, 0.78% of the content. When peers joined the session every 15 seconds, the additional average mes- sage overhead for delivering dependent encoding vectors was only about 6510 bytes in each node. However, NC saved far more bandwidth than BT. Recall the 128MB file was divided into 2048 and 1024 blocks in BT and NC, respectively. More blocks, more mes- sage overhead because each client sent its neighbors HAVE messages notifying the client received a new block. A BT client consumed 8.67MB for exchanging bitfields while a NC client spent 4.54MB, on average. As a result, NC introduced smaller message overhead than BT with affordable encoding overhead. We also considered a decoding process and its overhead. The complexity of the decoding process is O(m2n) where m is the number of blocks and n is the number of elements in an information vector. Here we use O() notation to hide constants. For a 128MB file with 1024 blocks (i.e., 128KB block size), it took 280 seconds to decode the file. To reduce the decoding time, we used multiple generations because the use of x generation reduces the decoding complexity to O(m2n/g). When we used 4 generations, it took 35 seconds to decode a 32MB generation and it took 140 seconds to decode the entire file. Furthermore, once a node downloaded one generation, it could start decoding the generation while receiving blocks in other generations. Thus, the node could finish decoding 35 seconds after downloading the last generation. Using multiple generations might introduce the generation scheduling problem. However, the use of a small number of generations did not significantly increase the downloading time. For a 128MB file distribution, the delay due to multiple generations was less than 10 seconds when we used i-Code with 4 generations. 62 Determining the numbers of blocks and generations in a file is not trivial. They largely affect performance and overhead for message exchanging, encoding, and decod- ing. Table 3.1 shows our parameters which we have empirically setup. BT trivially has only one generation and the number of blocks is provided for each file size. A pair in the NC row represents (the number of generations, the number of blocks) in a file. These parameters do not introduce much overhead without significantly delayed downloading times.

File size (MB) 32 128 256 512 1024 2048 BT 512 2048 2048 2048 4096 4096 NC (1,256) (4,1024) (4,1024) (8,1024) (8,1024) (8,1024)

Table 3.1: Parameters for generations and blocks

Summary of the practicality check

Network coding could provide near-optical block scheduling. Even in a flash-crowd with- out seeding, the near-optimal scheduling enabled peers to download the content almost at the ideal finish time. In addition, network coding was robust to environments where BT suffered from the performance degradation. In PlanetLab where nodes showed more dynamic and unexpected behavior, NC provided more reliable performance in the sense that the difference between downloading times of BT and NC was logner in such hos- tile environments. When BT peers had a small number of neighbors, they had limited information on which blocks other nodes had. This led to inefficient block scheduling and thus slowed downloading time. However, NC still showed better performance even though peers had a small number of neighbors by virtue of near-optimal scheduling. Similarly, NC provided a stable and shorter downloading time with varying block sizes while BT’s downloading time was relatively longer and largely affected by the block size. In heterogeneous environments, NC nodes with a fast link speed do not experience delayed downloads compared to BT. Due to the use of i-Code, encoding time could be reduced to less than 1% of the downloading time without sacrificing performance. Message overhead for delivering co- efficient vectors was also less than 1% of the content size. The overhead for exchanging bitfield-like information was 3.5% of the content size which was half of the overhead of 63 BT. Therefore, NC had less message overhead than BT in total. Furthermore, a node could largely hide the time delay for decoding by progressively decoding received gen- erations. In sum, network coding could provide improved performance and robustness with affordable overhead.

3.4 Related work

P2P applications for content distribution have recently gained popularity. Despite their success and popularity, they still suffer from inefficiency and reliability problems. To improve distribution speed and resolve data availability problem, many state-of-the-art P2P content distribution systems such as BitTorrent [13] use the local-rarest-first or random policy. Additionally, a number of P2P systems have proposed the use of source coding with Erasure Codes to efficiently transfer bulk data [46, 47]. However, these two types of approaches still suffer from those problems [15]. Network coding is a new type of data transmission technique which allows any nodes to encode data. While network coding was originally proposed to improve network throughput in a given topology [48], there is currently a large body of prior work on applying network coding to cooperative content distribution. Many studies have explored the usefulness of network coding for P2P content dis- tribution. Some of these studies are largely based on simulation or theoretical analyses. These studies, however, do not reflect real network conditions. Gkantsidis et al. [15] proposed a network coding-based P2P content distribution system with simulation re- sults. They compared the performance of non-coding, source coding, and network cod- ing approaches. Motivated by [15, 22], Yeung [42] studied the use of full random lin- ear network coding in P2P networks. He modeled such a network by a time-expanded trellis network where a physical node x at time t is represented as a node (x, t). Under this representation, a network coded P2P network becomes an acyclic directed graph, and the full random linear network coding achieves the maximum possible throughput. Yang et al. studied P2P file sharing based on network coding [20]. While many other approaches use random linear network coding with ad-hoc network topologies, their sys- tem uses deterministic network coding with highly structured topology. This provides better performance, but topology management is more difficult in practice than using 64 random coding on unstructured topology. There has been real-world implementation using network coding for content distri- bution. Based on [15], Gkantsidis et al. implemented a prototype system and tested it for distributing large files [22]. While this method demonstrated the feasibility of using network coding in P2P systems, the authors did not compare the performance of their system with previous P2P content distribution systems such as BitTorrent. By com- paring the performance of a non-coding system and a network coding enabled system, Wang and Li [23] explored the practicality of network coding in P2P systems. They modified an existing P2P design to use sparse random linear network coding, and used a cluster of high-performance servers with an emulated bandwidth restriction to perform experiments. However, too few nodes and small file sizes were used in their experiments. This may not reflect the conditions of today’s content distribution applications where large file sizes are common. Network coding was originally formulated such that all pieces in a peer’s local store were combined to produce an encoded block [25]. However, this approach is not practi- cal due to its encoding overhead. To reduce the encoding overhead, peers can use fewer input blocks to generate a coded block’ [27, 18, 22, 19]. These schemes have been imple- mented and tested. Xu et al. proposed Swifter [19], which divides a file into fixed-size generations and uses network coding at the generation level, and distributes generations based on the local-rarest-first algorithm. Ma et al. [17] designed a P2P system based on sparse linear network coding, and implemented the system and measured the perfor- mance. Also, Ma et al. [18] also improved the previous version by adding pre-checking of linear dependency. They measured the performance of their system on a PlanetLab experiment, and compared it with their implementation of a BitTorrent-like P2P sys- tem. Their experiment found that their scheme outperforms a BitTorrent-like system both in efficiency and robustness. In comparison with their work, we used a more effi- cient network coding scheme, and showed the performance of our scheme in PlanetLab experiments with more nodes and larger file sizes. Our work started with the question: How much can real-world applications benefit from network coding? Thus, we implemented i-Code into a real-world BitTorrent client. Similarly, Bickson et al. developed BitCod system, a network coding-based modification of BitTorrent [21]. They used a very small binary field GF(2) for encoding, and their 65 design requires the receiver to send the complete description of his data to the sender in order to avoid dependent blocks. Their simulations showed that the performance of BitCod is comparable to BitTorrent, while other versions using other heuristics perform much worse than BitTorrent. This approach leads to message overhead by sending the complete description of his data to the sender. On the other hand,with real-world ex- periments we show that our iTorent provides better performance and robustness than BitTorrent without much computational and message overhead.

3.5 Summary

Network coding is an emerging solution to address problems in P2P content distribu- tion. By allowing each peer in a network to encode data, network coding can simplify block scheduling. With network coding, a peer simply asks another node to send coded data, without deciding which specific data piece to upload or download. It also elimi- nates the requirement that each piece should be downloaded individually to complete downloading. Therefore, network coding can potentially provide better robustness and reliability for content distribution. Despite the benefits of network coding, it has not been widely used in the real world. There has been some doubt about the performance gains from network coding in practice. Network coding has also been blamed for its computational complexity and excessive resource use. In this chapter, we studied the practicality of network coding. We first sought a practical encoding scheme. According to the number of blocks to be combined for encoding, there are tradeoffs between encoding overhead and linear dependency. An encoding scheme which combines many blocks demand more encoding overhead although it produces smaller dependent blocks. On the other hand, a sparse coding with a small number of combined blocks generates many dependent blocks al- though it has smaller encoding overhead. There is no existing encoding scheme that satisfies both low encoding overhead and low level of block dependency. The primary contribution of our work is the design of i-Code which is lightweight and efficient. i-Code combines only two blocks for every encoding operation, dramatically reducing encod- ing overhead. However, it does not have the dependent block penalty faced by encoding schemes which combine few input blocks. The key idea is to emulate an encoding scheme 66 which combines many input blocks. To achieve that, each peer using i-Code maintains a “well-mixed” block which we call the accumulation block so that all blocks the peer has are “accumulated.” Therefore, mixing any block with the accumulation block has a similar effect of combining many blocks. In this way, i-Code has both low encoding overhead and low level of block dependency. To measure the performance and overhead in a real-world application, we imple- mented i-Code into BitTorrent clients. With these clients, we provided a thorough em- pirical comparison between a non-coding system (BT), and a network coding-enabled system (NC). Network coding could provide near-optical block scheduling and reduced the average downloading time. Even in a flash-crowd without seeding, the near-optimal scheduling enabled peers to download the content almost at the ideal finish time. In addition, network coding was robust to situations where BT experienced performance degradation. NC still provided reliable performance even when nodes behaved dynami- cally and unexpectedly. Also, the performance of NC was not largely affected by topolo- gies, block sizes, and heterogeneous capacities, and NC exhibited a shorter download- ing time than BT. Network coding introduced overhead for encoding, but the overhead could be negligible without sacrificing performance due to the use of i-Code. The addi- tional message overhead delivering encoding vectors was negligible and the total mes- sage overhead of NC was less than BT. Furthermore, the progressive decoding could hide the time delay for decoding. Therefore, we concluded that network coding can pro- vide improved performance and robustness with affordable overhead. Chapter 4

Secure Network Coding

Network coding is a promising data transmission technique where each node not only forwards but encodes data. Network coding can provide improved performance and robustness as shown in Chapter 3. However, in order to put network coding into practical use, there should be a way to protect it from pollution attacks, where an attacker injects a false corrupt packet. There are some homomorphic hash functions and homomorphic signature schemes designed to deal with this problem, but many such schemes do not have desirable features for cooperative content distribution. This chapter provides a study of a new homomorphic signature scheme for protecting against those pollution attacks. Section 4.1 briefly introduces attacks on network coding and existing countermeasures. Section 4.2 provides our new homomorphic signature scheme and its security analysis. Section 4.3 discusses practical issues on improving performance of signature verification. Section 4.4 presents measurement results on the performance of our secure network coding implemented in a real-world P2P content distribution system. Finally, Section 4.5 summarizes the chapter.

4.1 Preliminaries

Network coding is a data transmission technique which has potentially broad applica- tions, including traditional computer networks, wireless ad-hoc networks, and peer-to- peer systems. We have already shown that network coding can provide improved per- formance and robustness in cooperative content distribution. However, network coding

67 68 poses security vulnerabilities by allowing any untrusted nodes to produce new encoded data. In this section, we describe attacks against content distribution based on network coding and their countermeasures.

4.1.1 Threat Model

A content distribution model is based on general P2P distribution systems such as BitTorrent presented in Chapter 3. A single source which owns content F shares the content and peers interested in the content cooperatively participate in content distri- bution. F is divided into m pieces and the source encodes the pieces into the form of augmented blocks when it sends out content to other nodes. Peers which downloaded blocks from other nodes including the source linearly combine blocks they have and upload the blocks to other nodes. When peers receive at least m linearly dependent blocks, they can reconstruct the original content. See Section 3.1 for the details. P2P architectures inherently suffer from a number of attacks which include disrupting the topology, providing false information, dropping messages, and introducing other nodes to malicious neighbors. In this study we do not consider such types of attacks. Instead, we focus on attacks which arise from the use of network coding for content distribution. In our model, a content source generates content information (i.e., metadata) and distributes it so that other nodes can participate in the distribution session by reading it. We assume that peers trust this metadata information. Malicious attackers might distribute false content information but this issue is commonly resolved by using hash functions in real-world applications [13, 44]. Peers can select a desired metadata file by comparing the hash values of the received file and the hash for the metadata file advertised by the source. We do not deal with how peers get the right hash value in this study; it is possible to obtain the hash value from recommendations, web sites or advertisements. However, we assume nodes in a distribution session are not untrust- worthy. Some malicious nodes want to disrupt the distribution of the content by prop- agating corrupted blocks so that legitimate peers cannot decode the original content or to reduce the rate of content dissemination. Under these assumptions, we introduce pollution attacks. Network coding is significantly vulnerable to the pollution attacks. A malicious node can generate corrupted packets and then distribute them to other nodes. Observe that 69 commonly used methods for protecting the integrity of each block by using hash values do not work with network coding, since each peer produces unique encoded blocks for which hash values cannot be provided in the metadata file. Even worse, the corrupted block can in turn be used to (unintentionally) create new encoded packets that are also corrupted. It may be obvious that an incorrect file has been decoded, but the amount of bandwidth, storage, and computing time wasted on the invalid file cannot be recovered. Since blocks in the network are combinations of original file blocks, none of them is immediately verifiable as having correct content. Since it is not obvious which block is corrupted, the node must re-try downloading the entire file. Therefore, networks with untrusted members cannot use vanilla network coding (since it is too easy to compromise the availability of any and all files), but must implement a security protocol to verify coded data before passing it on to other nodes or reconstructing the file.

4.1.2 Related work

In P2P content distribution applications such as BitTorrent [13], hash schemes are commonly used to to verify the integrity of data. A source of a file computes hash values using a hash function and records them in a metadata file. A peer which is interested in the file gets the metadata file before downloading the file. Then, it obtains the hash values from the metadata file. When the peer receives a block, it computes a value of the received block by using the verification function (i.e. the hash function the source used). It then compares the computed value with the value in the metadata file. Signature schemes may be used in a similar way, but they are not commonly used. Unfortunately, these traditional solutions are not applicable to network coding, since each peer produces unique encoded blocks for which hash values cannot be provided in the metadata file. Therefore, several schemes have been proposed for securing network coding against the pollution attack. Most of the cryptographic solutions are network coding signature schemes. Since intermediate nodes which do not have the source’s secret key can encode ‘new’ blocks to be signed, traditional signature schemes cannot be used to protect against the pollution attack. But a network coding signature scheme allows a node to check whether a packet is a valid encoded vector in linear network coding; that is, whether it is a correct linear combination of the original packets of the file being transmitted. We here define 70 homomorphism of network signature schemes. Homomorphic network coding signature schemes do not verify a block against the signature computed for the whole file. Instead, a signature is generated for each block and verified with the block, but it is allowed combine signatures without the secret key of the source. We first list two non-homomorphic signature schemes. Krohn et al. [33] suggested using a new type of hash function to verify erasure-coded data, but it is also applicable for network coding. Using a collision resistant homomorphic hash function h which is secure under the discrete logarithm assumption, a source publishes the hash list

H = (h(v˜1), . . . , h(v˜m)) consisting of hashes of blocks of the file, which have to be distributed along with the description of h (which consists of n group elements), before receiving the file. This scheme can be classified as a network coding signature scheme Zhao et al. [34] proposed another network coding signature scheme, in which the source computes the signature based on a vector orthogonal to all of the augmented vectors vi of the file. Finding such a vector causes computational overhead although it is not heavy. Also, public key information cannot be used in multiple files. Homomorphic signature schemes can be characterized by their assumptions, opera- tional fields, and signature aggregation. Charles et al. [31] constructed a homomorphic signature scheme based on the aggregate signature of Boneh et al. [49]. It is based on the bilinear pairing and computational Diffie-Hellman assumption. Boneh et al. [32] for- malized the notions of the network coding signature and homomorphic network coding signature, and proposed two schemes. One is essentially an optimized adaptation of the homomorphic hashing scheme of Krohn et al. (it can also be considered as related to the scheme of Zhao et al.), and the other is a homomorphic signature scheme defined on bilinear groups, similar to the scheme of Charles et al. [31] but with reduced public key size. Yu et al. [50] proposed an homomorphic signature scheme based on RSA as- sumption, but because they use two different mod operations with two different moduli, in fact it is easy to see that their scheme fails to be homomorphic. Gennaro et al. [36] proposed network coding signature schemes based on integer arithmetic, instead of finite fields. Their homomorphic signature scheme can be considered as a corrected version of Yu et al. [50]. Their scheme needs integer arithmetic because they work in a hidden order group. Some have studied symmetric-key cryptographic solutions, instead of signatures. 71 Generally these schemes are much more efficient than network coding signature schemes, but key management is more difficult; keys should be distributed for each recipient separately. Gkantsidis and Rodriguez [30] proposed using secret random checksums to n verify received packets. A secret random mask α = (α1, . . . , αn) ∈ Fp is chosen for each n node, and for each block v˜i = (vi,1, . . . , vi,n) ∈ Fp of a file, the checksum is defined Pn as fα(v˜i) = j=1 αjvi,j. Each user receives (α, fα(v˜1), . . . , fα(v˜m)) through a secure channel, and verification of a packet is done by using the checksum. While very efficient, the source has the burden to distribute the secret keys individually to each node, and essentially requires the trusted central server. Agrawal and Boneh [35] formulated the notion of the homomorphic MAC, which may be considered as the symmetric-key version of the homomorphic signature scheme. The homomorphic MAC itself is susceptible to the pollution attack by insiders, but starting from a homomorphic MAC, they construct a broadcast MAC which mitigates the insider attack by protecting against collusion of c nodes.

4.1.3 Requirements for content distribution

This subsection outlines the requirements for solutions to protect network coding en- abled systems from the pollution attack. We, of course, consider basic requirements such as low computational and message overhead, but we here focus on desirable fea- tures which are specifically applicable to securing network coding for P2P content dis- tribution. Solutions for protecting network coding from the pollution attack use information such as keys, hashes, signatures, or checksums. We simply call this information key information. Basically, schemes for securing network coding should not require much computation for generating, aggregating, and verifying the key information. Charles’ and Boneh’s schemes [31, 32] have relatively higher computational overhead than other schemes because of pairing operations and the cost of signature generation and aggre- gation. (A detailed analysis of computational aspects will be provided in Section 4.2.2.) Secure network coding schemes may allow the generation of key information before content is prepared, which is referred to as online-construction in the chapter. This will give a system freedom to distribute the key information in advance. Furthermore, a solution can be utilized in various applications such as streaming services where new 72 content comes continuously. All homomorphic signature schemes for network coding provide the online-construction feature. On the other hand, Krohn’s, Gkantsidis’, and Zhao’s schemes do not allow online construction [33, 30, 34]. It might be desirable to use the same key information for multiple files. This reuse of key information definitely reduces overhead for generating and distributing the informa- tion. Charles’ and Boneh’s schemes [31, 32] can reuse public keys. However, the most of today’s P2P applications for content distribution do not use such information multiple times. In P2P applications, users are not likely to have key information of the specific users who owns content. Instead, users obtain metadata files and obtain the information when they need it. Furthermore, the size of such metadata files is commonly less than 1 MB. Therefore, we do not consider the reuse of key information in this work. In P2P content distribution, the secure network coding solutions should not rely on secure channels or symmetric secrets. For example, Gkantsidis’ scheme [30] requires secure channels to distribute checksums to each user. This scheme assumes that a server (or source of content) always stays in a distribution session and gives information to each node when it joins the session. However, this assumption is not realistic for many applications in P2P environment. The source may leave the session or suffer failure. Then, newly joining nodes cannot get secure checksums and cannot verify the data they receive. Moreover, nodes newly joining the session should contact the source to obtain the checksums. This limits scalability and cannot be used in large scale P2P systems. Also, symmetric-key cryptographic solutions are not desirable. Generally these schemes are much more efficient than network coding signature schemes, but key management is more difficult especially in P2P networks. We also point out that solutions should not expand the size of encoded blocks. The scheme in [36] uses integer arithmetic instead of a finite field arithmetic, which makes the size of each block variable. This is not appropriate for many applications and may become inefficient when the number of hops in the network becomes large. In the case of P2P systems such as BitTorrent, file blocks are continuously redistributed from multiple users over time and it is difficult to limit the number of times a block is transferred. In such applications, we do not know how many nodes blocks traverse (i.e., hop count) and how severe data expansion eventually will be become. We have shown that existing solutions for securing network coding do not have 73 desirable features or cannot be applied to P2P systems for content distribution. In the next section, we therefore propose a new homomorphic signature satisfying the requirements we discussed.

4.2 Secure Network Coding

In this section, we describe our homomorphic signature scheme which is referred to as KYCK signature. This scheme has low cost signature generation and aggregation. It also provides online-construction and does not require secure channels. We then analyze the security of our scheme and compare it with other schemes focusing on computational aspects.

4.2.1 Signature Scheme

Let p be the prime number where Fp is the base field of the network coding, and G be a group of order p where the discrete logarithm problem is hard. Select g as a generator of G. Specifically, we choose two primes p, q with p | q − 1. Then G is defined as the ∗ subgroup of order p of Zq. If |q| ≥ 1024 and |p| ≥ 160, then the currently conjectured difficulty of the discrete logarithm problem on G is about 280. def s Let s1,..., sm+n ∈ Zp be randomly chosen secret exponents, and let yi = g i .

Then the public key is defined as PK = (p, q, g, y1, . . . , ym+n), and the secret key is

SK = (s1, . . . , sm+n).

For a block w = (w1, . . . , wm, wm+1, . . . , wm+n), the signature σ for w is defined as

σ ← s1w1 + ··· + sm+nwm+n (mod p).

Signature verification is done by checking

σ ? w1 wm+n g = y1 ··· ym+n (mod q)

This signature is homomorphic: if w1,..., wk are input blocks, and σ1,..., σk are corresponding valid signatures received along with wi, then for any output block w = α1w1 + ··· + αkwk, the corresponding signature is σ ← α1σ1 + ··· αkσk. We can prove this scheme is secure under the assumption that the discrete logarithm problem is hard. The proof will be provided in Appendix A. 74 4.2.2 Comparisons

We here compare our KYCK signature with other schemes from [33, 34, 32]. We choose these schemes because they are closely associated with our work or relatively more appropriate for P2P content distribution than the other discussed in Section 4.1.

Table 4.1: Comparison of secure network coding schemes Krohn et al. Zhao et al. Boneh et al. KYCK Type hash sign. sign. sign. ∗ ∗ ∗ Group G ⊆ Zq , |G| = p G ⊆ Zq , |G| = p bilinear group G ⊆ Zq , |G| = p Public Key size constant m + n constant m + n m m Signature size m m + n (1 per block) (1 per block) m + n exp. and Initial cost mn exp. N/A m + n exp. mn mult. Signature cost N/A N/A m + n exp. m + n mult. Aggregation cost N/A N/A l exp. l mult. Verification cost m + n exp. m + n exp. m + n exp. m + n + 1 exp. Key pair reuse no no yes no

Table 4.1 provides a comparison between our scheme and those of Krohn et al., Zhao et al., and Boneh et al. Our scheme presents excellent characteristics, especially in the initial, signature, and aggregation costs. In the case of the homomorphic hash function of Krohn et al., the source has to compute mn exponentiations to generate the full hash list to be initially distributed to the receivers. In comparison, our scheme has an initial cost of only m + n exponentiations, about the same as verifying a single block, and only about 1/m of that of Krohn et al. In comparison with Boneh et al., our scheme has negligible signature generation and aggregation costs. The source’s computation amounts to approximately the verification of a single block. On the other hand, in the scheme of Boneh et al., the signature generation cost is comparable to the verification cost. Also, our scheme does not use the costly bilinear pairing operation. Our total cryptographic cost is actually very similar to the scheme of Zhao et al. – the total cost for signature generation is essentially the same. However, since ours is a homomorphic signature scheme, the cost is divided into blocks, and also the initially-distributed data is much more compact. Only the signature schemes of Boneh et al. are explicitly designed to be able to re-use key pairs for multiple files. This is a desirable property, but in other schemes, including 75 ours, considering appropriate parameter settings for application to P2P systems, the size of the initial data (including the public key) is typically less than 1% of the file size. For many applications where the size of the public key is negligibly small compared to the actual file, we may tolerate the inability to re-use key pairs. Also, if the application requires anonymity or repudiation, public-key re-use may not be a desirable property. In short, our scheme provides very excellent characteristics in comparison with other cryptographic authentication solutions for network coding, and usable in any application where the public key size is negligible in comparison with the file size, which includes typical P2P content distribution systems.

4.3 Practical Consideration

Most secure network coding schemes including ours require computational complexity for their security. In our scheme, the cost of verifying signatures of blocks is dominant. In this section, we describe our choices to optimize the verification performance.

4.3.1 Parameter Setup

We use the same notations again. A file which size is F consists of m blocks and each block has n elements (i.e. dimensions) in a field where an element is represented by l bit length. In our scheme, verifying one block takes m + n + 1 exponentiations. Before decoding, a node needs to verify at least m blocks and the verification cost is dominated by mn since m  n. In each peer, signature verification should be done after receiving each block, and also this computation can be performed while downloading another block. We thus are interested in the cost for a single verification, which is dominated by n exponentiations. Therefore, in order to decrease the verification cost, the smaller the n is, the better. In general, m is first determined because m value largely affects downloading speed and decoding time. Thus, a block size is commonly fixed to nl and n can be tuned by changing l size. This means that it would be better in general to take the bit length l as large as possible. We use the GNU MP Bignum Library [51] for implementation, and we discovered that the multiplication cost is almost linear up to l ≤ 512, not quadratic in l. The 76 reason is that, since arithmetic is done at the word level, not bit level, there is basic granularity at the word level (32 bits or 64 bits, depending on the compilation), and also GMP uses more sophisticated multiplication algorithms which have better asymptotic complexity than O(l2). For l larger than 512, the asymptotics kicks in and the cost deviates significantly from the linear graph. For this reason we are using l = 512 in our implementation.

4.3.2 Performance Boost

Verification with Multi-exponentiation

If the time for verifying one block is less than or equal to the time for receiving one, then we may hide the cost of signature verification by pipelining. However, when the verifi- cation time is longer, we may use multi-exponentiation to speed up the signature veri- Q ei fication. Multi-exponentiation evaluates a formula of form i gi , by processing many bits of the exponents together. Typically, multi-exponentiation consists of two stages: in the precomputation stage, basis elements gi are combined up to the chosen window size w, and in the evaluation stage, the precomputed data is used to process exponents together up to bit length w, therefore reducing the number of multiplications, while sac- rificing the memory for precomputation. For details, see [52]. For our signature scheme, we implemented the ‘Basic Interleaving Exponentiation Method’ of [52].

Batch Verification of Signatures

In order to reduce the cost of verification, we consider batch verification. With this ap- proach, k received blocks are verified together instead of verifying each block separately. In our batch verification, we generate a block by linearly combining k blocks and verify the signature of the aggregated block. Because this combination requires only opera- tions of multiplication and addition, the cost of batch verification is dominated by the time to verify the single block requiring exponentiation operations. Thus, the time of k batch verification is roughly 1/k the time to verify k block signatures separately. To use batch verification for performance boosting, we show that this method does not sacri- fice security. The proof will be provided in Appendix A. Therefore, without sacrificing security, we can verify multiple signatures with almost constant cost; cost to verify only 77 one block is necessary for any k signatures to be verified in batch.

4.4 Evaluation

In order to evaluate our secure network coding scheme, we provide measurement results on its performance and overhead in real-world systems. We implemented the scheme and incorporated it into our practical network coding scheme called iTorrent, a network coding-enabled BitTorrent client modified from CTorrent [26]. We conducted experi- ments in our local testbed, which consists of 30 machines equipped with 1.0GHz Dual- Core AMD Opteron(tm) Processors 1218 having 1.0 MB cache size. We mainly present performance in terms of time to sign and verify a block. In our scheme, signature generation was almost negligible; in general, it took only a few mil- liseconds to sign a block with the size of several megabytes because signature generation requires only operations of multiplication and addition. Signature verification was more costly and Figure 4.1 shows the signature verification time according to block sizes. The verification time almost linearly increased as the block size increased. Without using the multi-exponentiation approach discussed in Section 4.3, the verification speed was about 180KB/s as shown in “W=0” line. For example, it took 1.42 seconds to verify a 256KB block. How fast should the verification process be? It is, of course, desirable that the verifi- cation time becomes as small as possible. In order not to delay a distribution process, the verification time needs to be faster than the downloading time to receive blocks. Simply assume peer A has a unverified block w1 and is receiving block w2. In our system, we implemented two threads; one is responsible for verifying signatures of blocks and the other handles all operations except verification. Like “pipelining,” peer A is able to ver- ify the signature of w1 while receiving w2 simultaneously. If the time to verify a block is shorter than the time to receive a block, A can finish verifying w1 before it receives w2. Otherwise, w2 should wait without being verified until w1 is verified. Therefore, unverified blocks are accumulated gradually and this delays the downloading process. It is clear that this verification speed may degrade performance when a peer receives blocks faster than 180KB/s. Thus, we considered methods to boost the performance of our secure network coding. 78

6 W=0 5 W=2 W=4 W=8 4

3

2

1 Verification time (seconds) 0 128 256 512 1024 Block size (KB)

Figure 4.1: Signature verification time

To speed up the signature verification, we used multi-exponentiation discussed in Section 4.3.2. Figure 4.1 also compares the verification time according to the chosen window size for multi-exponentiation operations. By processing exponents together up to bit length W , we could reduce the number of multiplications and thus the verification time. This also required the memory space for precomputation. For linear increase of speed, the required memory increased exponentially. But, typically a small window size W up to 4 would be a reasonable tradeoff. We will use 4 as the window size for the following experiments. To reduce the cost to verify signatures, we also used batch verification as explained in Section 4.3.2. Instead of verifying each block, we collected k blocks and verified signatures of those blocks at the time with the cost to verify a single block. To empirically test the effectiveness of batch verification, we conducted experiments in our testbed. In the experiments, we distributed a 128MB file where the file consisted of 1024 blocks and those blocks were grouped into 4 generations. A source of the file always stayed in a distribution session. Other peers joined every 15 seconds and left the session immediately 79

350 300 250 200 150 100 50 Verification time (seconds) 0 1 2 3 4 5 6 7 8 9 10 Batch size

Figure 4.2: Signature verification time

after they completed downloading. Each peer’s upload bandwidth is limited to 80KB/s. Figure 4.2 shows the average total verification time, the total time for each peer to verify all blocks until it finished downloading the file. In the figure, the X-axis represents the batch size, k value used in experiments. We observed that the total verification time decreased as the batch size increased. When the batch size was 5, the total verification time was 93 seconds, which was only 5.9% of the total download time (1566 seconds on average). This improvement due to batch verification is quite promising considering the affordable verification overhead. Despite the benefit of batch verification, one might be afraid that the batch verifi- cation would increase downloading time. Suppose that peer A is currently receiving a block wi. Without verification, immediately after downloading wi, A can use wi for en- coding an outgoing block. With general secure network coding schemes, a received block can be used for encoding only after it is verified. This means that there is a delay in the time wi can be used for encoding, although this delay is small. In batch verification, this delay becomes longer than verifying each block because the verification is performed 80

1700

1650

1600

1550 Downloading time (seconds) 1500 1 2 3 4 5 6 7 8 9 10 Batch size

Figure 4.3: Signature verification time

after collecting multiple blocks. We examined how much the batch verification delays the downloading finish time. Figure 4.3 shows downloading times according to the batch size k. We observed that the batch size did not largely affect downloading time unless the batch size is not too large (for example, 10). Therefore, batch verification could dramatically reduce the cost of verification without delaying downloading time much. With our BitTorrent clients, we provide a thorough empirical performance compar- ison in terms of the downloading time. Figure 4.4 shows the average downloading time of each system: a system without network coding (BT), a system using network coding without the signature scheme (NC), and a system using secure network coding (SNC). We also put “1:1” download where a single client downloaded content directly from a server without any interference of other nodes. We repeated the same experiment as in Figure 4.3 where to the batch size was 3. In the given case, 1:1 download is consid- ered as an ideal downloading time. As shown in the figure, NC and SNC had almost the same downloading time. In our implementation, the security module for detecting data pollution ran in a separate thread. Therefore, a peer could check the integrity of 81

BT 12000 SNC NC 9000 1:1

6000

Time (seconds) 3000

0 32 64 128 256 512 1024 File size (MB)

Figure 4.4: Signature verification time

a block while concurrently receiving another block. Furthermore, because the time for the integrity check was shorter than the transmission time to download a block, SNC did not delay the downloading process. Therefore, SNC could provide the same level of performance as NC. We observed that the downloading times of NC and SNC were close to the 1:1 downloading time, which means they could provide a near-optimal schedul- ing as shown in Section 3. SNC introduced more overhead than NC due to the cost of signature verification. However, the signature verification process did not delay the downloading process while consuming 6% of CPU cycles. In sum, SNC can protect net- work coding-enabled system from pollution attacks while providing better performance than BT. Furthermore, the overhead due to signature verification can be reduced and it becomes affordable by performance boosting methods such as batch verification.

4.5 Summary

Peer-to-Peer content distribution systems run in inherently untrustworthy environments. The use of network coding introduces security vulnerabilities by allowing untrusted nodes to produce new encoded data. Network coding is especially vulnerable to pollution 82 attacks where malicious nodes inject false corrupted data into a network. Because of the nature of network coding, even a single unfiltered false block may propagate widely in the network and disrupt correct decoding on many nodes, by being mixed with other correct blocks. Since blocks are re-coded in transit, traditional hash or signature schemes do not work with network coding. Although several schemes have been proposed for securing network coding against pollution attacks, they are not appropriate for P2P systems. Some schemes require high computational overhead or data expansion when blocks traverse many nodes. Other schemes do not support desirable functionalities such as online-construction which is needed for streaming applications. The others require secure channels for sharing key information or use symmetric keys, which is hard to be used in P2P environments. Therefore, we designed a new homomorphic signature scheme for protecting against pollution attacks. Compared to existing schemes, our scheme requires smaller computa- tional overhead. It also comes with desirable features appropriate for P2P content dis- tribution. We implemented this scheme in a real-world BitTorrent client and measured its performance and overhead in content distribution sessions. In our scheme, signature generation and aggregation introduces negligible overhead and the signature verifica- tion cost is dominant. Due to multi-threads programming and multi-exponentiation op- erations, signature verification does not delay the downloading process. Therefore, our secure network coding-enabled system provides almost the same downloading time as the network coding system without the secure scheme. Furthermore, batch verification can dramatically reduce the total signature verification time without sacrificing perfor- mance. We finally conclude that our secure scheme can protect network coding from the attacks with affordable overhead. Chapter 5

Conclusion

The advent of peer-to-peer (P2P) technologies has shifted the paradigm of content distribution so it has become a promising candidate to make content distribution more scalable, more fault-tolerable, and faster. In this thesis, we have explored solutions to improve the efficiency, reliability, and security of large-scale P2P content distribution systems, by improving DHT lookups and making network coding practical and secure. We evaluated the lookup performance in Kad that is currently deployed on the Internet-scale. The current Kad lookups are inconsistent and waste system resources. Does node churn mean that DHTs are doomed to have poor lookup performance? We examined the interplay between application-level and routing-level semantics in Kad, in order to determine the primary cause of its poor lookup performance. The poor performance comes from the “good” feature that peers have a very consistent view of their surrounding areas among them; in other words, a part of their routing tables are similar. However, the Kad lookup algorithm does not work well with this feature. Our solution can be address this problem in an easily deployable fashion, leading to significantly improved performance. We believe that our study can serve as a useful guide to the future deployment of large-scale DHTs. The usefulness of network coding for content distribution has been disputed because of its questionable performance gains, coding overhead, and security vulnerabilities in practice. In this thesis, we sought to answer the question: how much does (secure) net- work coding benefit and cost real-world applications? To measure the performance and overhead in practice, we have modified a real-world BitTorrent client so that it uses

83 84 network coding. To make network coding more practical, we provided two solutions: i- Code that is a lightweight and efficient encoding scheme and KYCK signature that is a homomorphic digital signature to protect network coding from pollution attacks. Using both the network coding-enabled clients and non-coding clients, we conducted extensive experiments using many nodes communicating over a local area or wide-area network. To the best of our knowledge, this thesis provides the first empirical performance com- parison using a real-world application and including the overhead due to securing net- work coding. The measurement study shows that the network coding-enabled system can provide improved performance and robustness with affordable overhead. We hope that this promising result can motivate the research community to work on making network coding more usable. Decoding overhead and setting system parameters with theoretical analyses will be interesting topics for future work. References

[1] Gnutella. http://rfc-gnutella.sourceforge.net/src/rfc-0_6-draft.html.

[2] Kazaa. http://www.kazaa.com.

[3] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker. A scalable content-addressable network. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communica- tions (SIGCOMM ’01), pages 161–172, 2001.

[4] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakr- ishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM ’01), pages 149–160, 2001.

[5] Peter Druschel Antony Rowstron. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329–350, 2001.

[6] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John D. Kubiatowicz. Tapestry: A Resilient Global-scale Overlay for Service Deployment. IEEE Journal on Selected Areas in Communications, 22(1):41–53, 2004.

[7] Petar Maymounkov and David Mazi`eres. Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In the 1st International Workshop on Peer-to Peer Systems (IPTPS ’02), 2001.

85 86 [8] Azureus. http://azureus.sourceforge.net.

[9] BitTorrent, Inc. Mainline. http://www.bittorrent.com/.

[10] Moritz Steiner, Ernst W. Biersack, and Taoufik En-Najjary. Actively Monitoring Peers in KAD. In 6th International Workshop on Peer-to-Peer Systems (IPTPS ’07), 2007.

[11] Daniel Stutzbach and Reza Rejaie. Improving Lookup Performance Over a Widely- Deployed DHT. In Proceedings of 25th IEEE International Conference on Com- puter Communications (INFOCOM ’06), pages 1–12, 2006.

[12] Jarret Falkner, Michael Piatek, John P. John, Arvind Krishnamurthy, and Thomas Anderso. Profiling a Million User DHT. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (IMC ’07), pages 129–134, 2007.

[13] BitTorrent, Inc. BitTorrent. http://www.bittorrent.com/.

[14] Pablo Rodriguez and Ernst W. Biersack. Dynamic parallel access to replicated content in the internet. IEEE/ACM Transactions on Networking, 10:455–465, 2002.

[15] Christos Gkantsidis and Pablo Rodriguez Rodriguez. Network coding for large scale content distribution. In Proceedings of 24th IEEE International Conference on Computer Communications (INFOCOM ’05), pages 2235–2245, 2005.

[16] Mikel Izal, Guillaume Urvoy-Keller, Ernst W. Biersack, Pascal Felber, Anwar Al Hamra, and Luis Garc´es-Erice. Dissecting bittorrent: Five months in torrent’s lifetime. In Passive and Active Network Measurement, pages 1–11, 2004.

[17] Guanjun Ma, Yinlong Xu, Minghong Lin, and Ying Xuan. A content distribution system based on sparse linear network coding. In Proceedings of the 3rd Workshop on Network Coding, Theory, and Applications (NETCOD), 2007.

[18] Guanjun Ma, Yinlong Xu, Kaiqian Ou, and Wen Luo. How can network coding help P2P content distribution? In Proceedings of the 2009 IEEE international conference on Communications (ICC ’09), pages 1–5. IEEE, 2009. 87 [19] Jinbiao Xu, Jin Zhao, Xin Wang, and Xiangyang Xue. Swifter: Chunked network coding for peer-to-peer content distribution. In Proceedings of the 2008 IEEE international conference on Communications (ICC ’08), pages 5603–5608. IEEE, 2008.

[20] Min Yang and Yuanyuan Yang. Peer-to-peer file sharing based on network coding. In Proceedings of the 28th IEEE International Conference on Distributed Comput- ing Systems (ICDCS ’08), pages 168–175. IEEE Computer Society, 2008.

[21] Danny Bickson and Roy Borer. The BitCod client: A BitTorrent clone using net- work coding. In Manfred Hauswirth, Adam Wierzbicki, Klaus Wehrle, Alberto Montresor, and Nahid Shahmehri, editors, Peer-to-Peer Computing, pages 231– 232. IEEE Computer Society, 2007.

[22] Christos Gkantsidis, John Miller, and Pablo Rodriguez. Comprehensive view of a live network coding P2P system. In Proceedings of ACM SIGCOMM/USENIX Internet Measurement Conference (IMC ’06), pages 177–188, 2006.

[23] Mea Wang and Baochun Li. How practical is network coding? In Proceedings of the 14th IEEE International Workshop on Quality of Service (IWQoS), pages 274–278, 2006.

[24] Mea Wang and Baochun Li. Lava: A reality check of network coding in peer-to-peer live streaming. In Proceedings of 26th IEEE International Conference on Computer Communications (INFOCOM ’07), 2007.

[25] Shuo-Yen Robert Li, Raymond W. Yeung, and Ning Cai. Linear network coding. IEEE Transactions on Information Theory, 49(2):371–381, 2003.

[26] Holmes, D. Enhanced CTorrent. http://www.rahul.net/dholmes/ctorrent/.

[27] Philip A. Chou, Yunnan Wu, and Kamal Jain. Practical network coding. Allerton Conference on Communication, Control, and Computing, 41(1):40–49, 2003.

[28] Dah Ming Chiu, Raymond W. Yeung, Jiaqing Huang, and Bin Fan. Can network coding help in P2P networks? In Proceedings of the 4th International Symposium 88 on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks, pages 1–5, 2006.

[29] Brent Chun, David Culler, Timothy Roscoe, Andy Bavier, Larry Peterson, Mike Wawrzoniak, Mic Bowman, Brent Chun, David Culler, Timothy Roscoe, Andy Bavier, Larry Peterson, and Mike Wawrzoniak.

[30] Christos Gkantsidis and Pablo Rodriguez Rodriguez. Cooperative security for net- work coding file distribution. In Proceedings of the 25th IEEE International Con- ference on Computer Communications (INFOCOM ’06), pages 1–13, Apr. 2006.

[31] Denis Charles, Kamal Jain, and Kristin Lauter. Signatures for network coding. In Proceedings of the 40th Annual Conference on Information Sciences and Systems (CISS ’06), pages 857–863, Mar. 2006.

[32] Dan Boneh, David Freeman, Jonathan Katz, and Brent Waters. Signing a linear subspace: Signature schemes for network coding. In Stanislaw Jarecki and Gene Tsudik, editors, Public Key Cryptography, volume 5443, pages 68–87. Springer, 2009.

[33] Maxwell N. Krohn, Michael J. Freedman, and David Mazi`eres.On-the-fly verifica- tion of rateless erasure codes for efficient content distribution. In IEEE Symposium on Security and Privacy, pages 226–240. IEEE Computer Society, 2004.

[34] Fang Zhao, Ton Kalker, Muriel M´edard,and Keesook J. Han. Signatures for content distribution with network coding. In Proceedings of the 2007 IEEE International Symposium on Information Theory (ISIT 2007), pages 556–560, 2007.

[35] Shweta Agrawal and Dan Boneh. Homomorphic MACs: MAC-based integrity for network coding. In Michel Abdalla, David Pointcheval, Pierre-Alain Fouque, and Damien Vergnaud, editors, ACNS, volume 5536, pages 292–305, 2009.

[36] Rosario Gennaro, Jonathan Katz, Hugo Krawczyk, and Tal Rabin. Secure network coding over the integers. Cryptology ePrint Archive, Report 2009/569, 2009.

[37] The Pirate Bay. http://thepiratebay.org. 89 [38] Moritz Steiner, Taoufik En-Najjary, and Ernst W. Biersack. A Global View of Kad. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (IMC ’07), pages 117–122, New York, NY, USA, 2007. ACM.

[39] Moritz Steiner, Damiano Carra, and Ernst W. Biersack. Faster Content Access in KAD. In International Workshop on Peer-to-Peer Systems (IPTPS ’08),, 2008.

[40] Michael J. Freedman, Karthik Lakshminarayanan, Sean Rhea, and Ion Stoica. Non- Transitive Connectivity and DHTs. In Proceedings of the 2nd conference on Real, Large Distributed Systems (WORLDS ’05), 2005.

[41] Hendrik Schulze and Klaus Mochalski. Internet Study 2008/2009. http://portal. ipoque.com/downloads/index/study.

[42] Raymond W. Yeung. Avalanche: A network coding analysis. Communications in Information and Systems, 7(4):353–358, 2007.

[43] Michael Piatek, Tomas Isdal, Thomas Anderson, Arvind Krishnamurthy, and Arun Venkataramani. Do incentives build robustness in bittorrent. In Proceedings of 4th USENIX Symposium on Networked Systems Design and Implementation, 2007.

[44] Vuze, Inc. Vuze. http://www.vuze.com/.

[45] Vuze, Inc. Torrent Piece Size. http://wiki.vuze.com/w/Torrent_Piece_Size.

[46] John W. Byers, Michael Luby, Michael Mitzenmacher, and Ashutosh Rege. A dig- ital fountain approach to reliable distribution of bulk data. SIGCOMM Computer Communication Review, 28(4):56–67, 1998.

[47] Petar Maymounkov and David Mazieres. Rateless codes and big downloads. In International Workshop on Peer-to Peer Systems (IPTPS ’03), 2003.

[48] Rudolf Ahlswede, Ning Cai, Shuo-Yen Robert Li, and Raymond W. Yeung. Network information flow. IEEE Transactions on Information Theory, 46(4):1204–1216, 7 2000.

[49] Dan Boneh, Craig Gentry, Ben Lynn, and Hovav Shacham. Aggregate and verifi- ably encrypted signatures from bilinear maps. In Eli Biham, editor, Advances in Cryptology - EUROCRYPT 2003, volume 2656, pages 416–432. Springer, 2003. 90 [50] Zhen Yu, Yawen Wei, Bhuvaneswari Ramkumar, and Yong Guan. An efficient signature-based scheme for securing network coding against pollution attacks. In Proceedings of the 27th Conference on Computer Communications (INFOCOM 2008), pages 1409–1417. IEEE, 2008.

[51] Free Software Foundation. The GNU MP Bignum Library. http://gmplib.org.

[52] Bodo M¨oller.Algorithms for multi-exponentiation. In Serge Vaudenay and Amr M. Youssef, editors, Selected Areas in Cryptography, volume 2259, pages 165–180. Springer, 2001.

[53] Mihir Bellare, Oded Goldreich, and Shafi Goldwasser. Incremental cryptography: The case of hashing and signing. In Yvo Desmedt, editor, CRYPTO, volume 839, pages 216–233. Springer, 1994.

[54] Jung Hee Cheon and Dong Hoon Lee. Use of sparse and/or complex expo- nents in batch verification of exponentiations. IEEE Transactions on Comput- ers, 55(12):1536–1542, 2006. Appendix A

Security Analysis

A.1 KYCK Signature

The security of our homomorphic signature scheme (KYCK signature) can be proven under the assumption that the discrete logarithm problem is hard on the group G. The proof is an adaptation of the security proof for an incremental hash function given by Bellare et al. [53]. We assume that an attacker of the homomorphic signature may obtain any signature m+n of any vector of the subspace hu1,..., umi ⊆ Fp spanned by the augmented vectors u1,..., um of the file which the source wants to send via the network. Since the signature scheme is homomorphic, this is equivalent to the ability to obtain the signature σi of each augmented vector ui. Further, we even allow the attacker to choose the file itself, but before the source selects the key pairs for the file. The goal of the attacker is to forge (w, σ) where w 6∈ hu1,..., umi but σ passes verification with respect to w. Let A be such an attacker, and suppose that A halts within time t, with success probability at least . Using A, we construct a discrete logarithm solver B: given (g, y) where g is a gener- x $ ator of G and y = g for some randomly chosen x ← Zp, B uses A to find the exponent x. In order to do that, B simulates the whole network for A. B operates as follows: first, $ 0 $ for i = 1, . . . , n, B randomly picks ri ← {0, 1}, si ← Zp, and defines ym+i as

0 def ri s ym+i = y g i

91 92

When A selects the file (u˜1,..., u˜m), where u˜i = (ui,1, . . . , ui,n), B picks a random $ signature σi ← Zp for the augmented vector ui = (0,..., 0, 1,..., 0, ui,1, . . . , ui,n), and defines yi by

def σi −ui,1 −ui,n yi = g ym+1 ··· ym+n ,

def for i = 1, . . . , m. B then gives A the public key PK = (G, g, y1, . . . , ym+n), and also

σ1,..., σm as the signatures corresponding to the augmented vectors u1,..., um. s Note that so far the simulation is perfect: if si is the exponent satisfying yi = g i , s then in the actual game, si has to be chosen randomly and independently, and yi = g i and σi = si + sm+1ui,1 + ··· + sm+nui,n should hold. Because s1,..., sm are chosen in a independent uniform random way, σ1,..., σm should also be independent uniform 0 random, which was how they are chosen by B. Also, B chose sm+i as either si or 0 0 x + si, depending on ri (although B does not know the implicitly given x). Since si are chosen independently and uniform randomly, it follows that sm+1,..., sm+n are correctly chosen, and s1,..., sm are then determined by σi and sm+j (implicitly for B). After running up to t steps, A halts and returns a forgery candidate (w∗, σ∗). Suppose ∗ ∗ ∗ that this is a successful forgery. Let us write w = (α1, . . . , αm, w1, . . . , wn). Because ∗ def this is a successful forgery, w 6∈ hu1,..., umi. Let us define w = α1u1 + ··· αmum. ∗ Since w ∈ hu1,..., umi, w 6= w. Let us write w as w = (α1, . . . , αm, w1, . . . , wn). Then def σ = α1σ1 + αmσm is a valid signature of w, since σi are by definition valid signatures of ui. Then, ∗ ∗ ∗ σ = s1α1 + ··· + smαm + sm+1w1 + ··· + sm+nwn, and

σ = s1α1 + ··· + smαm + sm+1w1 + ··· + sm+nwn.

Hence, n ∗ X ∗ σ − σ = sm+i(wi − wi). i=1 0 By definition, sm+i = xri + si. So, n n ∗ X 0 ∗ X ∗ σ − σ − si(wi − wi) = x ri(wi − wi). i=1 i=1 Pn ∗ Therefore, unless i=1 ri(wi −wi) = 0, B may recover x and solve the discrete logarithm def ∗ problem. Definew ¯i = wi − wi. Thenw ¯i 6= 0 for some i. Let us estimate the probability 93 Pn that i=1 riw¯i = 0. Since the distribution of yi are independent from ri, we may as well assume that (r1, . . . , rn) is chosen after A’s choice of the nonzero vector (w ¯1,..., w¯n).

b Lemma 1. Let F be any field. For b = 0 or 1, let Rn(k) be the minimum of dimFhJi, n n+1 b for J ⊆ {0, 1} × {b} ⊆ F with |J| = k. We also define µ (n) as

b def b µ (n) = min{k | Rn(k) = n + b}.

Then, we have µ0(n) = µ1(n) = 2n−1 + 1

Lemma 1 can be easily proven using induction. From Lemma 1, Lemma 2 follows:

n Lemma 2. Let F be any field, and (v1, . . . , vn) ∈ F be any nonzero vector. If we pick $ r1,..., rn ← {0, 1} ⊆ F uniformly and independently, then

Pr[r1v1 + ··· + rnvn = 0] ≤ 1/2.

n−1 Proof. Suppose that Pr[r1v1 + ··· + rnvn = 0] > 1/2. This means that |J| ≥ 2 + 1, where def n J = {(r1, . . . , rn) ∈ {0, 1} | r1v1 + ··· + rnvn = 0}.

0 n−1 n But according to Lemma 1, since µ (n) = 2 + 1, J must span the full space F . This contradicts the assumption that (v1, . . . , vn) is nonzero.

From Lemma 2, we see that if A succeeds making a forgery with probability , then B may solve the discrete logarithm problem with probability at least /2.

A.2 Batch Verification

In order to reduce the cost of verification, we consider batch verification. With this approach, k received blocks are verified together at the time instead of individually. We again assume that an attacker of the homomorphic signature may obtain any m+n signature of any vector of the subspace hu1,..., umi ⊆ Fp spanned by the augmented vectors u1,..., um of the file which the source wants to send via the network. Let 0 p be the smallest prime factor of φ(N)/4. Given a signature σj for a block wj = (j) (j) (w1 , . . . , wn+m) (1 ≤ j ≤ k), the verification equation is

n+m (j) −σj Y wi g gi ≡ 1 mod N. i=1 94 0 We select k random positive integers r1, . . . , rk less than p and raising the rj’s power to the above equation and multiply all of them for 1 ≤ j ≤ k. Then we get the following equation: n+m Pk (j) Pk rj w − j=1 rj σj Y j=1 i g yi = 1 mod N. (A.1) i=1 We claim that, if the equation (A.1) holds, then the probability that one of signature is wrong is at most 1/p0. This is shown by the following lemma using a technique similar to one in [54][Theorem 1].

Lemma 3. Let S be the positive integers less than p0. Then the probability that a batch of signatures containing at least one wrong signature will pass the test is less than 1/p0.

Proof. The equation (A.1) is equivalent to

k n+m X X (j) rj(−σj + wi si) ≡ 0 mod φ(N)/4 (A.2) j=1 i=1 because the order of g is φ(N)/4. If the signatures in a batch are fixed, the equation

(A.2) is considered as a linear equation a1r1 + ··· + akrk ≡ 0 mod φ(N)/4 of k variables Pn+m (j) r1, . . . , rk over Zφ(N)/4, where aj = −σj + i=1 wi si.

If one of the signatures, say σ1, is wrong, then a1 6≡ 0 mod φ(N)/4. Then we have at least one primep ˆ dividing φ(N)/4 such that a1 6≡ 0 modp ˆ.

For each of (k − 1)-tuple r2, . . . , rk, we have at most one r1 < pˆ satisfying the equation (A.2). Hence the number of k-tuple (r1, . . . , rk) satisfying the equation (A.2) k−1 is at most 1/pˆ . If we select each of rj randomly from S, the probability that a k-tuple 0 (r1, . . . , rk) satisfies the equation (A.2) is 1/pˆ. Since p ≤ pˆ, we complete the proof.

If p − 1 and q − 1 are B-smooth, the size of p0 can be only several bits smaller than that of B. Then the failure probability 1/p0 is small enough for network coding. For example, one failure out of one billion verification is not bad.