Herry, H., Band, E., Perkins, C. and Singer, J. (2018) Peer-to-Peer Secure Updates for Heterogeneous Edge Devices. In: DOMINO Workshop at the IEEE/IFIP Network Operations and Management Symposium (NOMS 2018), Taipei, Taiwan, 23-27 Apr 2018, ISBN 9781538634165 (doi:10.1109/NOMS.2018.8406323)

This is the author’s final accepted version.

There may be differences between this version and the published version. You are advised to consult the publisher’s version if you wish to cite from it.

http://eprints.gla.ac.uk/159076/

Deposited on: 16 March 2018

Enlighten – Research publications by members of the University of Glasgow http://eprints.gla.ac.uk Peer-to-Peer Secure Updates for Heterogeneous Edge Devices

Herry Herry∗, Emily Band†, Colin Perkins‡, and Jeremy Singer∗ School of Computing Science, University of Glasgow, Glasgow G12 8QQ, United Kingdom ∗{herry.herry, jeremy.singer}@glasgow.ac.uk, †[email protected], ‡[email protected]

Abstract—We consider the problem of securely distributing address these concerns, but come with significant financial software updates to large scale clusters of heterogeneous edge and complexity costs. compute nodes. Such nodes are needed to support the of In this paper, we consider an alternative to such centralised Things and low-latency edge compute scenarios, but are difficult to manage and update because they exist at the edge of the systems, and explore whether peer-to-peer overlay protocols network behind NATs and firewalls that limit connectivity, or can enable secure, scalable, and cost-effective system updates because they are mobile and have intermittent network access. for large-scale networks of heterogeneous edge devices. We We present a prototype secure update architecture for these consider how to build an overlay in networks with limited devices that uses the combination of peer-to-peer protocols and connectivity, such as those behind network address translator automated NAT traversal techniques. This demonstrates that edge devices can be managed in an environment subject to partial or (NAT) devices, restricted firewalls, or with intermittent con- intermittent network connectivity, where there is not necessarily nections to the Internet, and how to use this to deploy updates direct access from a management node to the devices being in a scaleable manner. updated. To support this, we have designed and built a prototype decentralised management framework that distributes system I. Introduction updates and management scripts via a peer-to-peer overlay. This allows us to manage devices that do not have direct Deployment of smart campuses and smart cities use low- network access such as nodes behind network address trans- cost, low-power edge compute devices to build sensor and lators (NATs) or firewalls. Also, the nodes in local network control systems, and to provide smart edge compute nodes. could share the same update files, saving the bandwidth to Such systems often start small, with a few tens of nodes, but the external network. Moreover, we may not need a dedicated deployments can rapidly scale to many thousands of devices, update server since every node can provide the updates for the and future systems can be expected to be still larger scale. The others. Our framework combines several key techniques: devices that comprise these systems can be in inaccessible 1) STUN-based UDP hole punching to discover and open locations, be mobile, or be in private residences. Remote NAT bindings; administration is essential to ensure that security updates are 2) a gossip protocol to deliver short messages, similar in applied, and to install new applications and services. scope to Serf[4], to distribute update notifications; and Standard cluster management tools, such as Puppet [1], Chef 3) to securely distribute the software updates. [2], and Ansible [3], have a track record of allowing ad- ministrators to successfully manage large numbers of devices. Our contribution is to demonstrate that clusters of edge devices However, these tools rely on either direct push network access can be managed in an environment subject to partial or from the central management systems to the devices being intermittent network connectivity, and in the presence of NATs managed, or on having the devices have access to a well- and firewalls, where there is not necessarily direct access from connected central server from which they can pull updates. a management node to the devices being updated. Such cluster management tools work well for always-on nodes The remainder of this paper is organized as follows. We in well-connected networks, such as data centres, but are begin, in Section II, but outlining the challenges inherent in not suitable for more the distributed, mobile, ad-hoc, and building a peer-to-peer update protocol that can operate in otherwise poorly-connected environments we consider. today’s challenged network environment. We describe our pro- posed system architecture in section III, and give an example Additionally, relying on a central management server to of its operation in in Section IV. Related work is discussed in provide updates, whether push from that server to edge nodes section V. Finally, section VI concludes and considers future or pulling updates from the server, introduces robustness work. and scalability concerns. The central server is a single point of failure, vulnerable to hardware and software failures and II. Challenges in Peer-to-Peer Cluster Updates malicious attacks, and will likely become a performance The deployment and subsequent need to manage large-scale bottleneck as the system grows. The use of replicated servers, clusters of heterogeneous edge devices forces us to consider content distribution networks, and similar mechanisms can issues that do not occur in traditional data centre networks. These devices lack a standard device hardware configuration; tend to be relatively underpowered, and may not be able to Update Agent run heavy-weight management tools due to performance or Manager power constraints. In addition, they tend to have only limited Deployment BitTorrentBTclient network access due to the presence of firewalls, NATs, or NAT-TM (Transmission)Client intermittent connectivity, and so cannot be assumed to be Puppet Ansible Shell directly accessible by a management system. OperatingOperatingOperating System System System The issue of limited network connectivity is a significant challenge for the management and secure updating infrastruc- Fig. 1: The architecture of Peer-to-Peer Secure Update ture. Traditional data centres are built around the assumption framework. that hosts being managed are either directly accessible to the management system, or they can directly access the manage- ment system, and that the network is generally reliable. This 1) it can receive update notifications and download the is increasingly not be the case for the devices we consider. update file from at least one other peer; There are several reasons for this: 2) it can relay notifications/share updates with other peers; 1) devices will be installed in independently operated res- 3) both (1) and (2) should work even though the peer is idential or commercial networks at the edge of the behind NAT or firewall; Internet, and so will be subject to the security policies 4) it can independently deploy updates on local node; and of those networks, and will be protected by the firewalls 5) it can verify integrity of notifications and updates to enforcing those policies. avoid malicious activity, e.g., man-in-the-middle attacks. 2) devices located in independently operated networks may Driven by the above requirements, we design a framework be in different addressing realms, and traffic may need to whose architecture is depicted in figure 1. The framework uses pass through a NAT between the device being managed an agent, that runs on every node, which consists of four and the management system. components: manager, NAT traversal messaging (NAT-TM), 3) devices may not always be connected to the Internet, BitTorrent client, and deployment. The design is mainly tar- perhaps because they are mobile, power constrained, or getting a system that runs a Linux operation system, although otherwise have limited access. it can be adapted to other platforms. The manager is a component that receives an update notifi- Traditional cluster management tools fail in these environ- cation in the form of torrent-file either from an administrator ments since they do not account for NAT devices and par- or other agents. It uses a public-key, which is installed on tial connectivity, and frequently do not consider intermittent every node, in order to verify the integrity of torrent-file. It connectivity. Management and updating tools need to evolve to triggers the download by passing the torrent-file to BitTorrent allow node management via indirect, peer-to-peer, connections client, and calls deployment component to apply the update that traverse firewalls that prohibit direct connection to the once the download has finished. It also relays the notification system being managed. They must also incorporate automatic to its peers using a peer-to-peer overlay network provided by NAT traversal, building on protocols such as STUN and ICE NAT-TM. [5], [6] to allow NAT hole-punching and node access without The NAT Traversal Messaging component (NAT-TM) pro- manual NAT configuration1. vides a peer-to-peer overlay network which uses a gossip Existing cluster management tools scale by assuming a protocol to distribute a message to its peer, in scope similar restricted, largely homogeneous, network environment, but this to the one that is implemented in Serf [4]. It uses the UDP is not scenario we envisage for sensor and control networks, or hole punching via the STUN protocol [8], [5] to enable bi- edge computing. Such systems will increasingly be deployed directional communication with nodes behind NATs. in the wild, at the edge of the network, where the network The BitTorrent client receives a torrent-file from manager performance and configuration is not predictable. Update and and uses it to download the update file from its peers. Our management tools must become smart about maintaining con- prototype is using an off-the-shelf BitTorrent client i.e. Trans- nectivity in the face of these difficulties to manage nodes to mission [9] which supports Distributed Hash Table (DHT) which there’s no direct access. In the remainder of this paper, [10] that enables trackerless network. Manager monitors the we outline a prototype system that aims to begin to address download status from the log file generated by Transmission. these challenges. Finally, the deployment component receives the update file manager III. System Architecture and the target resource identifier from . It then invokes a provisioning tool such as Puppet or just simply a shell script Our peer-to-peer secure update framework requires every to apply the update. node to have the following capabilities: The standard format of torrent-file contains trackers’ address and files’ information, where each file’s information consists 1 Existing peer-to-peer management tools, for example APT-P2P [7] and of filename, file-length, piece-length, and cryptographic-hash HashiCorp Serf [4], require NAT/firewall pinholes to be manually configured, to allow access by the management tools, limiting their applicability to of file pieces [11]. However, there are some important in- environments with system administration support. formation that are missing from this standard, which are required by our framework to work properly, including: 1) • (Figure 2b) The agent of node 2 performs a UDP hole digital signature, for integrity checking; 2) resource identifier, punching by sending a message to the STUN server which distinguishes two different target resources that will be through the NAT. This opens a particular port (EPort2) updated; and 3) version, that helps the agent to ignore outdated on the NAT router which is accessible from internet. update which might still exist in the network. Because of this, The STUN server records this information by keeping we enrich the standard format with these new information. the internal IP and port (IP2, Port2) and the external IP For signature, we implement a draft scheme that is proposed and port (EIP2, EPort2) of node 2 in its session table. in [12] which uses a private-key owned by the administrator, Since the router will close EPort2 if there is no activity and a public-key installed on every node, for signing and for a period of time, then the agent periodically sends a verifying the torrent-file. Unfortunately, we are not aware keepalive message to the STUN server to keep the port of any proposal that provides the other two. Thus, we add open. a Uniform Resource Identifier (URI) and a (monotonically • (Figure 2b) For node 3 and 4, the administrator should increasing) version into our new format, which is called as manually reconfigure the firewall to open a particular port torrent-file++. allowing incoming UDP packets to a trusted node, which Our current design considers two main threats: 1) man- in this case is node 3. While node 4 will use node 3 as in-the-middle attack, and 2) re-distribution of an old, poten- a bridge. The agent of node 3 sends a message to the tially vulnerable, version of the software. The agent mitigates STUN server that allows its internal IP/port (IP3, Port3) the former by discarding any torrent-file++ that fail the and external IP/port (EIP3, EPort3) to be recorded in the digital signature verification. For the latter, the agent uses session table. the URI to retreive the last update version (from its local • (Figure 2c) The STUN server sends its session table to database) that had been applied before, and only accepts the node 1, 2 and 3 so that each node knows which IP/port torrent-file++ with newer version. that should be used to communicate with the others. This We choose BitTorrent because it has been widely proven activity is done periodically in order to propagate any to be reliable to securely distributing files. The file integrity changes on the session table whenever some nodes joins can be verified using cryptographic-hash of file-pieces. Popular or leaves the cluster. BitTorrent clients, such as Transmission, implement NAT Port • (Figure 2d) After the agent of node 1 knows the external Mapping protocol (NAT-PMP) [13] to work with NATs. We IP/port of node 2 and 3, then it sends the torrent-file also can limit the network usage of the BitTorrent clients for to these nodes as a notification that a new update is low bandwidth environments. available. The agents of node 2 and 3 then verify the integrity of torrent-file using their public-keys. If it is IV. Example verified, then the agent of node 3 sends the torrent-file to node 4, which will also verify it using its public-key. This section gives an example that illustrates the operations • (Figure 2e) Since the agents of node 2, 3, and 4 already of our peer-to-peer secure update framework. have the torrent-file, they could start downloading the Assume we have a 4-nodes system as illustrated in figure update-file by submitting the torrent-file to the BitTorrent 2a, where: node 1 is fully connected to the internet; node 2 client. is in a private network connected to the internet via a NAT; • (Figure 2f) When the download has finished, the agents and node 3 and 4 are in a private network connected to the can deploy the update on their local node by invoking internet via a restricted firewall which does not allow any necessary management scripts. communication initiated from outside. Each node runs an agent whose architecture is depicted in figure 1. V. Related Work As administrator, we would like to apply an update on all nodes to fix a security vulnerability. The process of Mainstream OS update services feature peer-to-peer con- deploying this update using our peer-to-peer framework can tent distribution. For instance, the Windows 10 ‘Delivery be summarised as follows: Optimization’ system [14] provides a peer-based distributed • (Figure 2a) The administrator creates a torrent-file from cache, as of late 2016, to mitigate bandwidth issues for large the update-file, signs it using a private-key, and uploads updates [15]. The Debian package management tool has an both files into node 1. Note that the files could be integrated peer-to-peer download facility called apt-p2p [7], uploaded to any node. [16]. It fetches update files using BitTorrent, via a transparent • (Figure 2b) The agent of node 1 verifies the integrity of http proxy. Package validation requires cryptographic hash torrent-file using its public-key as well as the integrity value comparisons with known values from a trusted server. of update-file using cryptographic hashes of file piece Dale and Liu [16] identify problems with peer-to-peer available in the torrent-file. The update will be deleted if updating, such as long wait times and different users requiring the verification fails. Otherwise, the agent deploys it on different subsets of available applications. Zhang et al. [17] local node and then passes the files to a BitTorrent client present optimizations to the underlying Kademlia distributed for sharing with other nodes. hash table (DHT) algorithm to improve the file-piece search EIP3:EPort3,IP3:Port3

Node 2 EIP2:EPort2,IP2:Port2 Node 2 Node 2 EIP3:EPort3,IP3:Port3 Node 1 Private Network Private Node 1 Private Network Node 1 Network NAT (EIP2:EPort2) NAT NAT STUN (EIP2:EPort2) Internet Server STUN Internet STUN EIP2:EPort2,IP2:Port2 Server Internet Server EIP3:EPort3,IP3:Port3 Firewall EIP2:EPort2,IP2:Port2 (EIP3:EPort3) Firewall EIP3:EPort3,IP3:Port3 Firewall (EIP3:EPort3) Private Network Private Private Network Network IP3:Port3 Node 4 Node 3 IP3:Port3 IP3:Port3 EIP2:EPort2,IP2:Port2 Node 4 Node 3 Node 4 Node 3 IP4:Port4 (a) (b) (c)

EIP3:EPort3,IP3:Port3 EIP3:EPort3,IP3:Port3 EIP3:EPort3,IP3:Port3

Node 2 Node 2 Node 2 EIP2:EPort2,IP2:Port2 EIP2:EPort2,IP2:Port2 EIP2:EPort2,IP2:Port2 EIP3:EPort3,IP3:Port3 EIP3:EPort3,IP3:Port3 EIP3:EPort3,IP3:Port3 Private Node 1 Private Node 1 Private Node 1 Network Network Network

NAT NAT NAT (EIP2:EPort2) (EIP2:EPort2) (EIP2:EPort2)

STUN STUN STUN Internet Internet Internet Server Server Server EIP2:EPort2,IP2:Port2 EIP2:EPort2,IP2:Port2 EIP2:EPort2,IP2:Port2 EIP3:EPort3,IP3:Port3 Firewall EIP3:EPort3,IP3:Port3 Firewall EIP3:EPort3,IP3:Port3 Firewall (EIP3:EPort3) (EIP3:EPort3) (EIP3:EPort3)

Private Private Private Network Network Network

IP3:Port3 IP3:Port3 IP3:Port3 Node 4 Node 3 Node 4 Node 3 Node 4 Node 3

EIP2:EPort2,IP2:Port2 EIP2:EPort2,IP2:Port2 EIP2:EPort2,IP2:Port2 IP4:Port4 IP4:Port4 IP4:Port4 (d) (e) (f) node updated-node torrent-file update-file Fig. 2: An example update process of a system where some nodes are behind a NAT (node 2) or a firewall (node 3 and 4). IPx, Portx, EIPx and EPortx are the internal IP address, the internal port, the external IP address, and the external port of the agent on nodex, respectively. performance. However these problems are not relevant to the Our work involves moving configuration management to a challenges mentioned in section II. more heterogeneous, distributed network, while still maintain- Other system-level peer-to-peer distribution services include ing some compatibility with existing solutions. For instance, peer-npm [18] which handles JavaScript package management. we support Puppet updates initiated via our overlay network. This is a modular system, so it can interface with multiple One early suggestion for updating a network of sensor distributed file system backends to retrieve packages and nodes involves assigning a random slot time for each node metadata. to download the update from a central server [20]. BitTorrent, torrent-tracker, and distributed hash table (DHT) implementations are all well-known aspects of peer-to-peer VI. Conclusion and Future Works distribution networks. Our system relies on these openly available technologies. We have developed a peer-to-peer secure update framework Serf [4] is a gossip-based protocol [19] that enables de- and proved that its prototype can update securely the system centralized service discovery and orchestration in a cluster of with partial network connectivity, and works in the presence nodes. It supports an event-based system, effectively enabling of NATs or firewalls. a peer-to-peer publish/subscribe model. Our system uses a Future work will integrate this peer-to-peer secure update 2 similar protocol to inform IoT nodes of configuration change framework into the FRµIT testbed to enhance scalability, and events. allow us to manage hosts in challenged network environments. Puppet [1], Chef [2], and Ansible [3] are industry-standard 2 configuration management tools, provisioning software and The Federated RaspberryPi µ-Infrastructure Testbed (https://fruit-testbed. org) is a testbed of federated, geo-distributed micro-datacenters. The project services on remote nodes. These tools generally require contin- aims to aggregate low-cost, low-power, commodity infrastructure to form an uous network connectivity for master/slave style orchestration. efficient and effective compute fabric for key distributed applications. Acknowledgements [11] B. Cohen, “The BitTorrent protocol specification,” http://www.bittorrent. org/beps/bep_0003.html, January 2008. This work is funded by UK Engineering and Physical [12] C. Brown, “Torrent Signing,” http://www.bittorrent.org/beps/bep_0035. Sciences Research Council under grant EP/P004024/1. html, July 2012. [13] S. Cheshire and M. Krochmal, “NAT Port Mapping Protocol (NAT- References PMP),” IETF, April 2013, RFC 6886. [1] V. Hendrix, D. Benjamin, and Y. Yao, “Scientific Cluster Deployment [14] Microsoft, “Windows update delivery optimization and Recovery - Using Puppet to simplify cluster management,” Journal and privacy,” https://privacy.microsoft.com/en-US/ of Physics: Conference Series, vol. 396, no. 4, Dec. 2012. windows-10-windows-update-delivery-optimization, 2017, Accessed: [2] M. Marschall, Chef Infrastructure Automation Cookbook. Packt Pub- 2018-01-27. lishing, 2013. [15] D. Halfin et al., “Deploy updates using Windows Update for [3] M. Mohaan and R. Raithatha, Learning Ansible : use Ansible to Business,” https://docs.microsoft.com/en-gb/windows/deployment/ configure your systems, deploy software, and orchestrate advanced IT update/waas-manage-updates-wufb, 2017. tasks. Packt Publishing, 2014. [16] C. Dale and J. Liu, “apt-p2p: A peer-to-peer distribution system for [4] HashiCorp, “Serf: Decentralized cluster membership, failure detection, software package releases and updates,” in IEEE INFOCOM 2009, 2009, and orchestration,” https://www.serf.io/, 2017, Accessed: 2017-12-20. pp. 864–872. [5] J. Rosenberg, R. Mahy, P. Matthews, and D. Wing, “Session traversal [17] Q. Zhang, J. Yu, L. Luo, J. Ma, Q. Wu, and S. Li, “An optimized dht for utilities for NAT (STUN),” IETF, October 2008, RFC 5389. linux package distribution,” in 15th International Symposium on Parallel [6] J. Rosenberg, “ICE: A protocol for NAT traversal for offer/answer and Distributed Computing (ISPDC), 2016, pp. 298–305. protocols,” IETF, April 2010, RFC 5245. [18] S. Whitmore, “peer-npm,” https://www.npmjs.com/package/peer-npm, [7] C. Dale, “apt-p2p – apt helper for peer-to-peer downloads of 2017, accessed: 2018-01-27. debian packages,” http://manpages.ubuntu.com/manpages/zesty/man8/ [19] P. T. Eugster, R. Guerraoui, A. M. Kermarrec, and L. Massoulie, apt-p2p.8.html, 2017, Accessed: 2017-12-20. “Epidemic information dissemination in distributed systems,” Computer, [8] B. Ford, P. Srisuresh, and D. Kegel, “Peer-to-peer communication vol. 37, no. 5, pp. 60–67, 2004. across network address translators.” in Proc. USENIX Annual Technical [20] G. Pollock, D. Thompson, J. Sventek, and P. Goldsack, “The Conference, 2005. asymptotic configuration of application components in a distributed [9] “Transmission,” https://transmissionbt.com, Accessed: 2018-01-25. system,” University of Glasgow, Technical Report, 1998. [Online]. [10] A. Loewenstern and A. Norberg, “DHT Protocol,” http://www.bittorrent. Available: http://eprints.gla.ac.uk/79048/ org/beps/bep_0005.html, January 2008.