Quick viewing(Text Mode)

Kalim U D 2017.Pdf (10.97Mb)

Kalim U D 2017.Pdf (10.97Mb)

Cognizant Networks: A Model and Framework for Session-based Communications and Adaptive Networking

Umar Kalim

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science and Applications

Wu-chun Feng, Chair Pavan Balaji Mark Gardner Eli Tilevich Yaling Yang

June 30, 2017 Blacksburg, Virginia

Keywords: Session Management, Context Awareness, Dynamic Network Configuration, Network-Stack Extensions, Next-Generation Networking

Copyright 2017, Umar Kalim Cognizant Networks: A Model and Framework for Session-based Communications and Adaptive Networking Umar Kalim ABSTRACT

The Internet has made tremendous progress since its inception. The kingpin has been the transmission control protocol (TCP), which supports a large fraction of communication. With the Internet’s wide- spread access, users now have increased expectations. The demands have evolved to an extent which TCP was never designed to support. Since network stacks do not provide the necessary functionality for modern applications, developers are forced to implement them over and over again — as part of the application or supporting libraries. Consequently, application developers not only bear the burden of developing application features but are also responsible for building networking libraries to support sophisticated scenarios. This leads to considerable duplication of effort. The challenge for TCP in supporting modern use cases is mostly due to limiting assumptions, simplis- tic communication abstractions, and (once expedient) implementation shortcuts. To further add to the complexity, the limited TCP options space is insufficient to support extensibility and thus, contemporary communication patterns. Some argue that radical changes are required to extend the networks func- tionality; some researchers believe that a clean slate approach is the only path forward. Others suggest that evolution of the network stack is necessary to ensure wider adoption — by avoiding a flag day. In either case, we see that the proposed solutions have not been adopted by the community at large. This is perhaps because the cost of transition from the incumbent to the new technology outweighs the value offered. In some cases, the limited scope of the proposed solutions limit their value. In other cases, the lack of backward compatibility or significant porting effort precludes incremental adoption altogether. In this dissertation, we focus on the development of a communication model that explicitly acknowl- edges the context of the conversation and describes (much of) modern communications. We highlight how the communication stack should be able to discover, interact with and use available resources to compose richer communication constructs. The model is able to do so by using session, flow and endpoint abstractions to describe communications between two or more endpoints. These abstractions provide means to the application developers for setting up and manipulating constructs, while the abil- ity to recognize change in the operating context and reconfigure the constructs allows applications to adapt to the changing requirements. The model considers two or more participants to be involved in the conversation and thus enables most modern communication patterns, which is in contrast with the well-established two-participant model. Our contributions also include an implementation of a framework that realizes such communication methods and enables future innovation. We substantiate our claims by demonstrating case studies where we use the proposed abstractions to highlight the gains. We also show how the proposed model may be implemented in a backwards compatible manner, such that it does not break legacy applications, network stacks, or middleboxes in the network infrastructure. We also present use cases to substantiate our claims about backwards compatibility. This establishes that incremental evolution is possible. We highlight the benefits of context awareness in setting up complex communication constructs by pre- senting use cases and their evaluation. Finally, we show how the communication model may open the door for new and richer communication patterns.

This work is supported in part by Juniper Networks and Virginia Tech. Cognizant Networks: A Model and Framework for Session-based Communications and Adaptive Networking Umar Kalim GENERAL AUDIENCE ABSTRACT

In this dissertation, we focus on the development of a communication model that explicitly acknowl- edges the context of the conversation and describes (much of) modern communications. We highlight how the networking software should be able to discover, interact with and use available resources. The model is able to do so by using abstractions that describe communications between participants as if human beings were having a conversation i.e., the semantics of interactions between participants are defined in terms of a conversation session. These abstractions provide means to the application developers for describing communications in a holistic manner, recognizing change in the context and reconfigure communications to allow adaptation to changing requirements. The model considers two or more participants to be involved in the conversation and thus enables most modern communication patterns, which is in contrast with the well-established two-participant legacy model. Our contributions also include an implementation of a framework that realizes such communication methods and enables future innovation. We substantiate our claims by demonstrating case studies where we use the proposed abstractions to highlight the gains. We also show how the proposed model may be implemented in a backwards compatible manner, such that it does not break legacy applications, networking software, or network infrastructure. We also present use cases to substantiate our claims about backwards compatibility. This establishes that incremental evolution is possible. We highlight the benefits of context awareness in setting up complex communication constructs by presenting use cases and their evaluation. Finally, we show how the communication model may open the door for new and richer communication patterns.

This work is supported in part by Juniper Networks and Virginia Tech. iv

To

my parents, Ataullah and Ghazala,

my wife, Mariam,

my sisters, Umairah and Faria,

and my brother, Zubair, for their unconditional love and support. v

ACKNOWLEDGEMENTS

I thank Almighty Allah, who blessed me with the ability to pursue graduate studies and persevere. Alhamdullilah (all praise and thanks is for Allah). Many people have contributed in different ways towards the successful completion of my graduate studies. I am grateful to everyone involved, for all manner of support. My experience as a doctoral student has been intensive and rewarding. If there was one reason that enabled me to persevere, it would be the unconditional support and encouragement of my family. My parents, Ataullah and Ghazala, always reassured me that by the grace of Allah I can surmount all obstacles; they instilled in me that with tenacity I can persevere through all challenges. My regular conversations with them allowed me to benefit from their wisdom and stay positive in spite of the circumstances. Their continuous prayers, wise counsel, well wishes, and nurture have enabled me to accomplish all that I have in my life. My wife, Mariam, went above and beyond in enabling me to make continuous and effective progress towards graduation. Her limitless compassion and encouragement allowed me to overcome numerous challenges. I cannot be thankful enough for the unconditional support of my sisters, Umairah and Faria, and prayers of my brother Zubair. My son Ahmed, with his enthusiasm and liveliness, has always reminded me to take a step back from the daily grind and be grateful for the blessings that have been bestowed upon us. I am in debt to my advisor, Dr. Wu Feng, for his continuous guidance and support. He introduced me to the challenges of scientific research; his patient approach enabled me to learn the ropes, understand the process of research and develop my academic abilities. He provided me with ample opportunities to grow and facilitated my participation in various scholarly and scientific activities — e.g., reviewing technical papers, writing and presenting research papers, and volunteering for state-of-the-art setups such as SCinet at the Super Computing conference. I learned many life lessons by silently observing him and his approach towards challenges, which I hope will allow me to emulate him and be a better human being. His approach of leading by example, such as putting in long hours or going the extra mile, always pushed me to do better. I am also grateful to my thesis committee, Dr. Pavan Balaji, Dr. Mark Gardner, Dr. Eli Tilevich, and Dr. Yaling Yang, for their valuable comments, feedback, and support. Dr. Mark Gardner and Eric Brown have been instrumental towards my research. This dissertation would not have been possible without their invaluable support, guidance, and patience in bearing with my endless questions. They were always accessible and available; I consider myself lucky to have access to these wise and compassionate professionals. I cannot be appreciative enough of their support. The support, encouragement, and sage advice that I received from Dr. Eli Tilevich had a significant impact on my academic development. His compassionate approach and genuine concern about my professional well-being enabled me to make smart decisions. I will never forget his wise words. I am grateful to my mentor Dr. Les Cottrell, whose compassion, guidance, encouragement and appre- ciation greatly boosted my confidence. His incredible work ethics, attitude of not shying away from challenges, enthusiasm towards responsibilities, and diligent approach in general left a lasting impact on me. I consider myself to be privileged to have worked alongside him. I would not have been able to pursue doctoral studies without his encouragement and support. It is impossible to sail through the adventurous times of graduate studies without the support of our colleagues and comrades. I thank all my colleagues at the Synergy Laboratory in making this journey possible. I am particularly, grateful to Balaji, Naveed, Arshad, Tozammel, Konstantinos, Abid Ali, Abdul Hafeez, Mustafa, Ashwin, Vignesh, Nataliya, Sarunya (Kwang) and many others. I am also thankful for the encouragement and advice of my colleagues and friends Ali Khayam and Qasim Ali. vi

I am also grateful to my colleagues at Advanced Research and Computing, particularly Brian Marshall, Dr. James McClure, and Justin Krometis for their generous support. My graduate studies would not have been possible without the support of Juniper Networks and Ad- vanced Research and Computing; their financial support through Graduate-Research Assistantships al- lowed me to continue my education at Virginia Tech. I am also grateful to the Department of Computer Science and the Graduate School at Virginia Tech for providing me with the administrative support to help complete my graduate studies. Contents

1 Introduction 1 1.1 Motivation ...... 2 1.2 Problem Statement ...... 8 1.3 Research Contributions ...... 9 1.4 Outline ...... 11

2 Related Work 13 2.1 Session-Layer Proposals ...... 13 2.1.1 TESLA - A Transparent, Extensible Session-Layer Architecture for End- to-End Network Services ...... 14 2.1.2 Session Layer Concept for Overlay Networks ...... 16 2.1.3 A Session-Based Architecture for Internet Mobility ...... 18 2.1.4 Phoebus: A Session Protocol for Dynamic and Heterogeneous Networks . 19 2.1.5 Taking Advantage of Multi-homing with Session Layer Striping ...... 21 2.1.6 Open Systems Interconnection (OSI) Model ...... 22 2.1.7 Session-Initiation Protocol ...... 24 2.2 Transport-Layer Proposals ...... 25 2.2.1 Structured Stream Transport ...... 26 2.2.2 TNG: Transport Next Generation ...... 27 2.2.3 Stream Control Transmission Protocol ...... 29 2.2.4 Multipath TCP ...... 30 2.3 Network-Stack Extensions ...... 30

vii viii

2.3.1 SERVAL: An End-host Stack for Service-centric Networking ...... 31 2.3.2 Congestion Manager ...... 32 2.3.3 Mobile IP (v4 and v6) ...... 33 2.3.4 MSOCKS - An Architecture for Transport Layer Mobility ...... 33 2.3.5 Host Identity Protocol ...... 35 2.3.6 Middlebox Communication (MIDCOM) Protocol Semantics ...... 36 2.4 Clean-Slate Designs ...... 36 2.4.1 Networking is IPC: A guiding principle to a better Internet ...... 37 2.4.2 Internet Indirection Infrastructure ...... 38 2.5 Discussion ...... 39

3 Session-Based Communication Model Enabling Context-Awareness 44 3.1 Conflation of Session and Transport Semantics in TCP ...... 46 3.2 Session-Layer Abstractions ...... 48 3.3 SLIM’s Architecture ...... 54 3.3.1 Session Management ...... 55 3.3.2 Negotiation of Configuration ...... 58 3.3.3 Services ...... 61 3.3.4 Session vs. Transport Semantics ...... 61 3.4 Communication Patterns ...... 62 3.4.1 Client Server ...... 63 3.4.2 Peer to Peer ...... 63 3.4.3 Publish Subscribe ...... 64 3.4.4 Broadcast ...... 64 3.4.5 Survey ...... 64 3.4.6 Pipeline ...... 65 3.5 Prototype Implementation ...... 66 3.5.1 Session State ...... 66 3.5.2 Data Flows ...... 67 ix

3.5.3 Flow Labels and Greater Functionality ...... 67 3.5.4 Structure of Flows ...... 67 3.5.5 Flow-to-Transport Mappings ...... 68 3.5.6 Control Flows ...... 68 3.5.7 Session Labels and Registry ...... 70 3.5.8 Support for Legacy Applications ...... 70 3.5.9 Support for Mobility, Migration, and Resilient Communications ...... 71 3.6 Discussion ...... 72 3.6.1 Separation of Session and Transport Semantics via Session-Based Ab- stractions ...... 72 3.6.2 Enabling Greater Functionality ...... 73 3.6.3 Enabling Innovation and Extensibility ...... 75 3.6.4 Backward Compatibility and Adoption ...... 75 3.6.5 In the Presence of Middleboxes ...... 76 3.6.6 Applying Lessons Learned to Non-TCP Transport ...... 77 3.6.7 Development Effort — Cost vs. Value ...... 78 3.6.8 User Space vs. Kernel ...... 79 3.6.9 Security Considerations ...... 79 3.6.10 Performance Evaluation ...... 80 3.7 Applying Pipelining to TCP for Efficient Communication over Wide-Area Net- works: A Case Study Exemplifying Benefits of Context Awareness ...... 85 3.7.1 Background ...... 85 3.7.2 Analytical Model and Cascaded TCP ...... 88 3.7.3 Experimental Setup ...... 91 3.7.4 Results ...... 93 3.7.5 Discussion ...... 101 3.7.6 Conclusions ...... 106 3.8 Summary of Session-Based Communication Model ...... 107 x

4 Enabling Extensions to the Network Stack 108 4.1 Approach Towards TCP Extensions ...... 110 4.2 Proposed Solution ...... 112 4.2.1 Concept and Semantics ...... 113 4.2.2 The Wire Protocol ...... 119 4.3 Discussion ...... 121 4.3.1 TCP Option Space ...... 121 4.3.2 Incompatible Options ...... 122 4.3.3 Performance ...... 122 4.3.4 Simplicity ...... 123 4.3.5 SYN Cookies ...... 123 4.3.6 Middleboxes ...... 123 4.3.7 Security ...... 124 4.3.8 Application Compatibility ...... 125 4.4 Case Study Exemplifying TCP Extensions: Virtual Machine Migration Beyond Subnets ...... 125 4.4.1 Background ...... 126 4.4.2 Existing Approaches of VM Migration ...... 128 4.4.3 Challenges for VM Migration ...... 131 4.4.4 Methodology ...... 132 4.4.5 Discussion & Evaluation ...... 138 4.4.6 Summary of Case Study ...... 147 4.5 Case Study Exemplifying TCP Extensions: Resilience in the Presence of Middleboxes ...... 147 4.5.1 Conceptual Design ...... 149 4.5.2 Extending TCP ...... 153 4.5.3 Implementation and Evaluation ...... 156 4.5.4 Alternate Methods to Manage Communications Involving Middleboxes . 162 4.5.5 Summary of Case Study ...... 164 xi

4.6 Summary ...... 164

5 Enabling New Communications Paradigms 166 5.1 Middleboxes Inferring Application State versus Being First-Class Citizens .....167 5.1.1 Examples of Interaction with Middleboxes ...... 168 5.2 Explicit Interaction with Middleboxes ...... 169 5.2.1 Generic Interactions with Middleboxes ...... 170 5.2.2 Key Insight ...... 171 5.2.3 Typical Workflow ...... 172 5.2.4 Classification of Messages ...... 172 5.2.5 Verb Templates ...... 173 5.2.6 Towards Incremental Adoption ...... 174 5.3 Interactions with Firewalls ...... 175 5.3.1 Design Considerations ...... 175 5.3.2 Typical Workflow of Explicit Interactions with Firewalls ...... 177 5.3.3 Protocol and Semantics ...... 179 5.3.4 State Machine ...... 184 5.3.5 Implementation Considerations ...... 185 5.3.6 Cascaded Access ...... 187 5.3.7 Backward Compatibility ...... 187 5.3.8 Policy Enforcement and Access Control ...... 188 5.4 SLIM Extensions in Specialized Domains ...... 189 5.4.1 Background ...... 190 5.4.2 Characterization of MPI-Related Faults ...... 191 5.4.3 Approaches Towards MPI Resilience ...... 192 5.4.4 SLIM’s Integration with MPI ...... 196 5.4.5 Prototype Implementation ...... 199 5.4.6 Discussion ...... 201 5.4.7 Related Work ...... 204 xii

5.4.8 Future Work ...... 206 5.5 Summary ...... 207

6 Summary and Future Work 208 6.1 Summary of Dissertation ...... 208 6.2 Related Directions of Research ...... 209 6.2.1 Cross-Layer Communication ...... 209 6.2.2 Expanding New Communication Paradigms ...... 210 6.2.3 Policy Management and Enforcement ...... 210 6.2.4 SLIM’s Application in Specialized Domains ...... 211

Bibliography 211 List of Figures

1.1 An illustration of typical network stack configurations, highlighting some design choices for implementing supporting network libraries...... 3 1.2 A simplified illustration of relationships between the application and session layer, along with the rest of the stack. The session layer aims to implement some of the common features that developers implement as supporting libraries or look for in third-party implementations...... 9 1.3 Summary of PhD dissertation research contributions...... 12

2.1 Two TESLA stacks are shown to illustrate how session-layer services may be put together. The stack on the left supports presentation layer semantics, while the other supports disconnection management. Note: This image has been taken from the authors’ research paper [12]...... 15 2.2 The session layer header used to communicate control information between communicating peers. Note: This image has been taken from the authors’ re- search paper [19]...... 17 2.3 An illustration of the TCP vs. Phoebus model. Note: This image has been taken from the authors’ research paper [23]...... 20 2.4 Session layer striping architecture where we have k session having the ability to access n network interfaces. Note: This image has been taken from the authors’ research paper [11]...... 21 2.5 OSI Reference Model — The ISO Model of Architecture for Open Systems In- terconnection. Note: This image has been taken from the authors’ research paper [79]...... 23 2.6 An illustration of the SIP session setup with SIP trapezoid. Note: This image has been taken from the RFC 3261 [81]...... 24 2.7 An illustration of the structured streams abstraction. Note: This image has been taken from the authors’ research paper [18]...... 26

xiii xiv

2.8 Separating the application and network level semantics of the transport layer. Note: This image has been taken from the authors’ research paper [35]...... 28 2.9 An SCTP association. Note: This image has been taken from the RFC 4960 [27]. 29 2.10 An illustration of the MPTCP network stack and its comparison with the standard TCP stack. Note: This image has been taken from RFC 6824 [85]...... 30 2.11 An illustration of the SERVAL network stack highlighting the separation of service- level data and control. Note: This image has been taken from the authors’ re- search paper [26]...... 31 2.12 The network stack with the Congestion Manager. Note: This image has been taken from the authors’ research paper [33]...... 32 2.13 Network topology showing how the mobile node may interact with the proxy while communicating with its peer. This implies use of multiple network inter- faces. Note: This image has been taken from the authors’ research paper [40]. . 34 2.14 A single layer of IPC that consists of host with user applications and IPC subsys- tems. Note: This image has been taken from the authors’ research paper [51]. . 37 2.15 Internet Indirection Infrastructure (i3): An illustration of communication be- tween two nodes where the receiver inserts a trigger and the sender sends cor- responding data. Note: This image has been taken from the authors’ research paper [90]...... 38

3.1 The session abstraction involving three participants, each with two data flows instantiated by the application. The control flow is an out-of-band channel that allows setup and reconfiguration...... 49 3.2 Contrast of endpoint and socket abstractions...... 49 3.3 Endpoint label...... 50 3.4 The flow abstractions and their mappings onto underlying transports in relation to time...... 52 3.5 The structure of flows in relation to the endpoints...... 52 3.6 Session label...... 53 3.7 SLIM in relation to the network stack...... 54 3.8 Session state-transition diagram...... 56 3.9 An illustration of communication patterns...... 62 3.10 SLIM in relation to legacy applications and those using the library...... 70 xv

3.11 Average throughput, with 90% confidence interval, for Socket and SLIM (1 Gbps link capacity, 0% loss)...... 81 3.12 Average throughput for Sockets and SLIM (1 Gbps link, < 1 ms RTT 0% and 1% loss...... 81 3.13 Flow’s perspective with SLIM and increasing number of participants (1 Gbps link, varying RTTs, 0.01% loss) ...... 82 3.14 TCP’s perspective with SLIM and increasing number of participants (1 Gbpslink, varying RTTs, 0.01% loss) ...... 82 3.15 The long-haul (end-to-end) TCP connection is split into two independent TCP connections by the relay, each with smaller latencies...... 86 3.16 Congestion control feedback is coupled with TCP acknowledgments and thus latency between peers has a significant influence on the control signaling. .... 87

3.17 Bandwidth utilization decreases with increase in latency and/or link capacity — i.e., bandwidth delay product...... 94 3.18 Bandwidth utilization with a single relay for varying link capacities (4 – 128 Mbps) and latency (8 – 512 ms) at a loss of 0.1%. The 95% confidence intervals are not shown here as all observations fall within 5% of the mean...... 96 ± 3.19 Results for link capacity of 32 Mbps, latencies of 64 ms and 128 ms and losses of 0%, 0.01%, 0.1% and 1% are presented to compare long-haul and cascaded TCP with one and two relays. Zero relays imply long-haul TCP...... 97 3.20 Cascaded TCP continues to perform well in spite of high losses...... 98 3.21 Measured bandwidths are analyzed with respect to Mathis’ approximation of upper limit on bandwidth when using TCP.The y-axis represents the measured bandwidth (Mbps) where as the x-axis represents the bandwidth estimated by Mathis’ approximation [105]. Each line represents estimates at given link ca- pacities...... 99 3.22 Estimated and measured throughput results are presented for link capacity of 128 Mbps with losses of 0.001%,0.01%,0.1% and 1%. In the model we use loss of 0.001% to approximate 0% loss. Note that bandwidth-delay products are proportional to latencies which are shown in this figure...... 102

4.1 A highlight of the ratio of TCP based traffic to the aggregate volume of traffic exiting the Virginia Tech campus on 30th March 2016. Note that UDP consumes about 10% of the share as the Chrome browser uses the QUIC protocol [77] to communicate with Google servers using UDP — which is another testament to the need for extending the network stack...... 109 xvi

4.2 The Isolation Boundary in the context of the TCP/IP stack and the Tng layers. . 112 4.3 Sequence diagram of the exchange of Isolation Boundary Options during con- nection setup...... 114 4.4 An illustration of the transport-independent flow mapping to TCP connections. 115 4.5 The proposed transport-independent flow option...... 120 4.6 An illustration of the steps involved and where they take place during a VM Migration...... 135 4.7 TCP State Transition Diagram with the addition of Isolation Boundary Options (i.e., TIFID_A, TISeq_A, TIAck_A). Isolation boundary options are sent between the highlighted states. The arrows indicate state transitions. The transitions are labeled with actions and message types (e.g., SYN, ACK). A transition may be labeled as < cmd >/ or < pkt >/. For example Send/SYN im- plies that a send command was received and a SYN message was sent, where as SYN+ACK/ACK implies that a SYN+ACK message was received and an ACK message was sent. Successful delivery implies transition to where the arrow leads. TIAck are 0 when not in use (e.g., upon Active open, TIAck_B is 0). The dotted boxes indicate close commands (i.e., both passive and active)...... 136 4.8 Network configuration at the server, before and after VM migration...... 143 4.9 TCP connection state at the server, before and after migration...... 143 4.10 TCP connection state at the client, before and after migration. Note the listening socket on the same port is an artifact of our implementation...... 143 4.11 Application state at the server, when logged from the client, before and after VM migration. Note that the environment variable is set at the time of connection setup and is oblivious to change in configuration...... 143 4.12 Time for client to reconnect vs. round-trip time...... 161

5.1 Entities hosted by middleboxes in relation to explicit interactions with end- points. An example of SLIM’s context manager engaging the firewall service is highlighted by the dotted arrow. (The middlebox may host one or more mid- dlebox services.) ...... 171 5.2 Classification of interactions between endpoints and middleboxes ...... 172 5.3 A partial representation of firewall-related concepts. (We state partial because a source may also be listed as a network interface and not just an endpoint. Doing so encompasses all traffic flowing from the said interface.) ...... 177 5.4 An example of explicit interaction between end-points and middleboxes .....178 xvii

5.5 Firewall state machine ...... 184 5.6 Our classification of faults ...... 192 5.7 State diagram illustrating fault detection, mitigation, and recovery...... 193 5.8 Master and worker configuration between groups of processes using MPI inter- communicators ...... 194

5.9 Open MPI Architecture (recreated from [182])...... 196 5.10 Incremental deployment and integration with Open MPI Byte Transfer Layer (BTL)199 5.11 SLIM in relation to legacy applications and those using the library...... 200

5.12 Trace of average latency for BTL+TCP (Socket API) vs BTL+SLIM+TCP (SLIM) using unprimed long-running microbenchmarks (1 Gbps link capacity, 0% loss) 202

5.13 Trace of average throughput for BTL+TCP (Socket API) vs BTL+SLIM+TCP (SLIM) using long-running microbenchmarks (1 Gbps link capacity, 0% loss) ..202 List of Tables

2.1 Comparison of select state of the art vs. our contributions (3does/can support, 7: does not support, 7 / 3: subjective)...... 40

3.1 Setup time between peers ...... 83 3.2 Response time of non-blocking reconfigurations ...... 83 3.3 Session’s memory footprint ...... 84 3.4 Values used to configure Dummynet and emulate testbed...... 92 3.5 PlanetLab paths tested as part of a case study...... 100 3.6 Findings from a select case study ...... 100 3.7 Predicted and measured throughput (link capacity of 8 Mbps) ...... 103

xviii Chapter 1

Introduction

Computer networks have made tremendous progress since their inception. The kingpin has been the transmission control protocol (TCP) [1], which supports a large fraction of commu- nication between network applications1. We may argue that it is the simplicity and efficacy of the protocol that has resulted in its wide-spread adoption and thus enabling the proliferation of the Internet.

With such wide-spread access, users now have increased expectations. For example, we find users switching between different devices while using the same application. They now expect to remain connected to the network at all times, whether it is via their hand-held devices or desktops and whether they are within reach of one network or moving between networks.

There is an unqualified expectation to utilize all available resources for performance and/or reliability. Furthermore, communication hardware has evolved substantially over the years.

For example, hand-held devices are ubiquitous; most have multiple network interfaces and the network infrastructure includes middleboxes to facilitate desirable functionality. Networking

1An example that illustrates this is the network traffic at Virginia Tech, where TCP delivers over 2 TB per day over the wireless network (Aug, 2014). It is widely accepted that about 90% of the packets flowing through the networks have TCP payloads [2].

1 CHAPTER 1. INTRODUCTION 2

software on the other hand, has not maintained pace with the evolution of hardware. A few

years ago, network access was limited to web browsing and simple file transfers. However,

now the demands have evolved to an extent which TCP was never designed to support [3].

Because the necessary functionalities for modern applications are not provided by the network

stacks, developers are forced to implement them, over and over again, as part of the application

or supporting libraries — e.g., support for seamless handoff in Android OS [4] and Apple

Continuity [5]. (The relationship of such libraries with the network stack is illustrated in Figure 1.1.) Hence, application developers not only bear the burden of developing application

features but are also responsible for building networking libraries to support sophisticated

scenarios (e.g., WiFi to cellular network handoff [4]). This leads to considerable duplication of effort.

1.1 Motivation

As Jennifer Rexford said, “the Internet is showing signs of age” [6]; there is a dire need for greater functionality in the communication stack. The challenge for TCP in supporting mod-

ern use cases is mostly due to limiting assumptions, simplistic communication abstractions, and

(once expedient) implementation shortcuts. For example: 1) the assumption that network ad-

dresses do not change during communication (i.e., the duration of a connection is expected

to be shorter than the lifetime of the network address assignment) led to identifying network

connections by network addresses, which makes mobility a challenge; 2) the expectation that

an application would typically need to communicate over a single network impedes simulta-

neous use of multiple network interfaces; and 3) the expectation that the protocols to be used

by the network stack are determined before hand precludes dynamic stack configuration. To

further add to the complexity, the limited TCP options space is insufficient to support extensi- CHAPTER 1. INTRODUCTION 3

Application Application User St Libraries St Space Libraries Transport Transport Network Network

Link Link

(a) Libraries built in user space, as (b) Libraries built in user space, part of the application independent of the application

Application User Application Space Library Module A

Libraries Module B Transport Kernel Transport Kernel Space Network Network Space

Link Link

(c) Libraries built in kernel space, (d) Libraries built in user and ker- independent of the application nel space, independent of the appli- cation

Figure 1.1: An illustration of typical network stack configurations, highlighting some design choices for implementing supporting network libraries. bility and thus contemporary communication patterns [7]. Hence, some of the challenges that developers face in implementing modern use cases are:

Naming Transport flows are labeled using the 5-tuple (i.e., protocol, source and destination

network addresses and ports: ) re- sulting in a coupling of the network address and the flow label. Consequently, we are

handcuffed to one network interface for the lifetime of the transport connection. There-

fore, we cannot imagine maximizing resource utilization even if the end host has multiple

network interfaces and is potentially connected to different networks.

Mobility The transport connections are labeled using the network addresses. If the device CHAPTER 1. INTRODUCTION 4

moves from one network to another, the network interface address would change and so

would the transport label. This change results in a disruption in communication, as the

network stack sees this as a disconnection.

Flow Abstraction Application flows are thought to be transport flows. This is a result of the

use of the socket abstraction, where each (logical) application flow is directly mapped

onto a transport flow and the application essentially operates on an endpoint abstrac-

tion [8]. Without decoupling the two, the mapping of an application flow onto a single transport flow is static and limited. Consequently, features such as those below must be

implemented by application developers rather than be provided as part of the networking

stack:

Multipath Transport With a flow abstraction between the application and the transport

mechanisms, we may have the application flow mapped onto the flow abstraction

and the flow abstraction mapped onto transport flow(s). This introduces the pos-

sibility of having more than one transport flow, potentially taking different paths

through the network.

Multihoming Separating the flow abstraction from transport allows application flows

to be mapped onto more than one transport maintained over different network

interfaces.

Flow Migration Without an indirection (i.e., a flow abstraction), it is not possible to

seamlessly migrate the mapping of an application flow from one transport to an-

other.

Hybrid Transport Without the flow abstraction, we cannot imagine the application flow

being mapped onto a construct of longitudinally hybrid transports (e.g., a compo-

sition of packet- and circuit-switched transports to form the transport flow). CHAPTER 1. INTRODUCTION 5

Transport Independent Flows Without a flow abstraction we preclude the possibility

of having a plugin architecture where appropriate transport implementations may

be dynamically used and/or swapped for better alternatives.

Session Management Legacy abstractions assume that network communication begins with

the instantiation of transport connections and ends with their termination. This is a

conflation of session and transport semantics. If processes were to migrate between hosts,

the communication session continues; however, the existing transport connections may

become invalid and new connections would be required.

It is important to distinguish between the session and transport semantics and imple-

ment them in their respective layers. This allows the underlying layers to provide the

necessary and efficient supporting mechanisms so application developers do not have to

implement them in their applications or libraries where information about the state of

communication held by the operating system is lost.

Process Migration Without explicit session management, migrating a process from one

end host to another in a manner that the processes’ communication is not disrupted

is not possible.

Two or More Participants Without explicit session management, application develop-

ers are burdened with the responsibility of maintaining session state as part of the

application. This not only results in duplication of effort, but also increases com-

plexity as the number of participants in a session increase. It is for this reason that

we see supporting libraries such as Apple Continuity [5], which are implemented as application substrates. Thus, it becomes a challenge to implement two or more

endpoints participating in the same conversation and communications are forced

towards the two-participant model. CHAPTER 1. INTRODUCTION 6

Other Contemporary Use Cases With explicit session management, we open multiple

avenues for innovation and extensibility. Without it, developers are burdened with

the bookkeeping necessary to manage use cases such as: 1) applications estab-

lishing multiple streams between hosts, 2) an existing transport connection being

leveraged to instantiate new connections — the 3-way handshakes becomes redun-

dant for subsequent connections because the session has already incurred the cost of

bootstrapping, and 3) congestion or flow control window sizes derived from exist-

ing connections without violating fairness constraints, thus avoiding the slow start

phase.

Cross Layer Communication Although each layer in the network stack possesses valuable

information (e.g., link state at link layer), the layers do not make such information ac-

cessible to other layers. Having access to such information may enable inferring context

(e.g., existence of a path to destination).

Middleboxes as First-Class Citizens The end-to-end communication model, upon which TCP

is based, has indeed proved to be tremendous success. Nevertheless, excluding mid-

dleboxes from explicit participation in communications makes the work of middleboxes

more challenging and limits the services they can provide. This prevents middleboxes,

such as firewalls, accelerators and load balancers, from offering much richer services

without disrupting communication.

Dynamic Stack Configuration As the configuration of the communication stack is assumed

to be static and determined before the conversation starts, the possibility of a pluggable

architecture is precluded. For example, it would not be possible to replace the current

congestion control module for an ongoing TCP connection with an alternate (e.g., XCP)

at run time. CHAPTER 1. INTRODUCTION 7

The network stack, and particularly TCP,neither supports such functionality nor does modifying it to support new functionality in a backwards-compatible manner appear viable [9–12]. Some argue that radical changes are required to extend the networks functionality; some researchers believe that a clean slate approach is the only path forward [13–16]. Others suggest that the evolution of the network stack is necessary to ensure wider adoption — by avoiding a flag day [7,17].

As we explain in chapter 2, the need for greater functionality in the communication stack, in particular TCP, has been reiterated in recent research [7, 9–12, 17–65]. However, we see that the proposed solutions have not been adopted by the community at large. This is perhaps because the cost of transition from the incumbent technology to the new technology outweighs the value offered [66]. In some cases, we see that value offered by the proposals does not justify the transition cost, due to the limited scope of the proposed solution. In other cases, the lack of backward compatibility or significant porting effort precludes incremental adoption altogether.

It appears that the foremost reason for limited adoption is the resistance to move away from

TCP,due to its widespread deployment. Therefore, it is fair to conclude that TCP’s success in itself has become an impediment for its evolution [7,63,67]; the legacy behind communication stacks, in particular TCP, is such that it does not allow for further evolution towards next- generation networks. TCP is largely viewed as good enough. The users as well as network operators do not want a change which does not look or feel like TCP. However, change is necessary if we are to incorporate increased functionality. Thus, the key to introducing change would therefore be to admit incremental adoption, thereby facilitating a smooth evolution of network communication, while minimizing duplication of effort. CHAPTER 1. INTRODUCTION 8

1.2 Problem Statement

After studying the state of the art and understanding the design and implementations of mod-

ern use cases, we observe that there are common aspects which serve as fundamental building

blocks in enabling contemporary communications. These include the need for:

1. A notion of communication contexts, which illustrates how the communication stack

should be able to discover, interact with and use available resources to compose richer

communication constructs,

2. Communication abstractions that provide means to the application developers for setting

up and manipulating communication constructs,

3. The ability to recognize change in operating context of communications and reconfigure

the constructs to adapt to the requirements, and

4. A communication model that considers two or more participants to be involved in the

conversation — which is in contrast with the well-established two-party model [68].

Thus, in the effort to meet the ever-increasing need for evolution of network communication

software, we are led towards the question:

How do we design, implement and evaluate a communication model that describes abstrac-

tions enabling extensions to the network stack, which benefit from the use of context, allows

for adaptation to manage changes in the context, and considers the possibility of two or more

participants involved in a conversation?

In an attempt to answer this question, we leverage the lessons learned from the state of the

art and propose a fundamental building block, a session layer, which serves as a backwards- CHAPTER 1. INTRODUCTION 9

compatible extension to the current TCP/IP stack, while servicing modern use cases as those mentioned above. As illustrated in Figure 1.2, the session layer services applications while

interacting with the network stack on the application’s behalf. This layer includes some of the

common features that developers are forced to implement over and over again, due to lack

of support in the legacy stack — e.g., fault tolerance and resilient connectivity. In addition,

the framework is designed in a manner that allows for future extensions. We discuss design

choices of the session-layer implementation in chapter 3. The research contributions that came

as a consequence of developing the session layer are discussed in the following section.

Application Application Libraries Libraries Transport (TCP) Transport (TCP) Application Network (IP) Network (IP) Session Link (Ethernet) Link (Ethernet) Transport (TCP)

Network (IP)

Link (Ethernet)

Figure 1.2: A simplified illustration of relationships between the application and session layer, along with the rest of the stack. The session layer aims to implement some of the common features that developers implement as supporting libraries or look for in third-party implementations.

1.3 Research Contributions

In this dissertation, we focus on the development of a communication model that explicitly

acknowledges context and describes (many) modern and current communications. It does so

by using a session abstraction to describe communications between two or more endpoints2.

Our contributions also include an implementation of a framework that realizes such commu-

2We use the term endpoints and participants interchangeably. We explain how they map to processes in chap- ter 3. CHAPTER 1. INTRODUCTION 10 nication methods, enabling future innovation. We substantiate our claims by demonstrating case studies where we use the aspects of session abstraction to highlight the gains.

The research contributions of this dissertation are:

Model:

Session-Based Communication We develop a model to describe modern communica-

tions involving two or more participants. We do so by defining endpoint, flow, and

session abstractions, their primitives and interactions with each other and the net-

work stack [69].

Separation of Session and Transport Semantics We also identify and discuss the sep-

aration of session and transport semantics, as well as the implications of such a

separation on communication patterns [69].

Context Awareness We present a case study [70, 71], which concludes that context awareness can improve end-to-end throughput over long-distance communication.

This is to substantiate our claim that context awareness opens the door to a multi-

tude of benefits.

Realization:

Enabling Extensions We propose and implement a solution, which we refer to as ISOTCP,

to extend the network stack, particularly TCP [72].

Backwards-Compatible Extensions We demonstrate how such extensions reduce du-

plication of effort while being backwards-compatible and enabling incremental adop-

tion [73,74]. In doing so, we demonstrate a case study of virtual machine migration beyond a subnet while maintaining network connectivity. We also study the impact CHAPTER 1. INTRODUCTION 11

of middleboxes while extending TCP to be fault tolerant in the face of disconnec-

tions.

Use:

Enabling New Communication Paradigms We show how our proposed extensions en-

able new communication paradigms, by virtue of the ability to dynamically con-

figure communications and interact with multiple participants as part of a single

communication session. We show how explicit interaction with middleboxes, in

particular firewalls, can enable robust communications [75]. We also show how

the proposed extensions may be applied in specialized domains [76].

Interactions with Middleboxes In doing so, we define the mechanisms by which net-

work stacks would interact with middleboxes and the general characteristics of a

protocol that enables dynamic configuration and reconfiguration of communica-

tions.

The contributions listed above are summarized in the Figure 1.3 below.

1.4 Outline

This thesis is organized as follows. In chapter 2, we present a survey of the state of the art and

discuss related efforts. We present the communication model, the abstractions, their design

and implications for existing communication patterns, their implementation and related case

studies in chapter 3. In chapter 4, we present how the proposed extensions may be incorpo-

rated into the existing TCP stack in a backwards-compatible manner. We also demonstrate

the use of such extensions in real-world networks (i.e., in the presence of middleboxes). We CHAPTER 1. INTRODUCTION 12

A Model and Framework for Session-based Communications and Adaptive Networking

Model: Session-based Realization: Use: Dynamic Communications Enabling Extensions Configuration Enabling (Chapter 3) (Chapter 4) New Paradigms ü (Chapter 5) Abstractions involving two ü Incremental Evolution or more participants [PFLDNeT’10] ü Enabling New [ACM/IEEE ANCS’17] • How to extend and Communication • Design: primitives and overcome limitations? Paradigms interactions ü Backwards Compatibility and • Middleboxes as first- ü Separation of Session and Less Duplication of Effort class citizens Transport Semantics ü Mobility & Virtual (e.g., firewalls) ü Case Study: Context Machine Migration IEEE TON’17 Awareness and Advantages [ICCCN’13] ü Towards Resilient MPI [IEEE GLOBECOM’13] ü Resilient Transport • Open MPI • Cascaded TCP [ICCCN’11] [FTXS’17] [ACM/IEEE SC’12] * [Published] * Poster

Figure 1.3: Summary of PhD dissertation research contributions. demonstrate resilient transport and virtual machine migration as case studies of network stack extensions. We then present the possibilities of enabling new communication paradigms in chapter 5. Finally, we conclude in chapter 6 by identifying relevant directions of future re- search. Chapter 2

Related Work

There have been several notable attempts in the past to extend the network stack and pro-

vide richer and more sophisticated services to applications and thus enabling modern use

cases [9–12,18–24,26–37,40–46,49,51,53,54,56,57,65,77]. We broadly classify them into the following groups: 1) session-layer proposals, 2) modern transport-layer proposals, 3) network

stack extensions, and 4) clean-slate designs.

2.1 Session-Layer Proposals

The proposals discussed below directly or indirectly suggest the use of session semantics. To

achieve this, they present abstractions that enable richer functionality. Although, in some cases,

in spite of richer functionality the session semantics continue to be implicit and are limited to

the scope of transport connections. Nevertheless, their use allows developers to benefit from

and build upon supported communication features into the applications to enable modern use

cases.

13 CHAPTER 2. RELATED WORK 14

2.1.1 TESLA - A Transparent, Extensible Session-Layer Architecture for

End-to-End Network Services

TESLA, a transparent extensible session layer architecture for end-to-end network services, pro- posed in 2003, presents notable session layer services that are based on a flow abstraction [12]. The authors propose the use of flow handlers to implement higher-layer and end-to-end ser- vices such as connection multiplexing, congestion state sharing, application-level routing, and mobility management. Features such as these improve the robustness as well performance of the network applications.

TESLA builds an abstraction of session layer services allowing the network application to op- erate with network flows, presented as objects, instead of calling functions on the sockets API.

Also, the solution is implemented as a shim layer which runs in user space and traps net- work operations – this is realized by employing methods of dynamic library interposition (e.g.,

LD_PRELOAD [78]). The implementation is done by writing event-handlers with a callback- oriented interface between handlers. This method also allows programmers to implement features to add functionality.

In Figure 2.1 we see two sample scenarios. The stack on the left illustrates the case where the

TESLA library supports presentation layer semantics such as encryption. As mentioned above these presentation layer semantics may be implemented by programmers to extend the func- tionality of the stack. The TESLA stack assists in trapping the calls and triggering the registered handlers which in turn write to the next flow, which in this case is the actual transport flow.

Similarly the stack on the right illustrates the case where the TESLA library is not only sup- porting presentation layer semantics but is also enabling access to multiple network interfaces.

The TESLA library maps the incoming logical flow to one transport flow using say network 1, however in case of a disconnection the library seamlessly migrates the mapping to a different the application itself, TESLA transparently interposes it- Input flow self between the application and the kernel, intercepting from upstream Zero or more Flow handler output flows and modifying the interaction between the application to downstreams and the system—acting as an interposition agent [18]. It uses dynamic library interposition [9] to modify the interaction between the application and the system. This technique is popular with user-level file systemsCHAPTER such 2. as RELATEDFigure WORK 1: A flow handler takes as input one network flow 15 IFS [10] and Ufo [1], and several libraries that provide and generates zero or more output flows. specific, transparent network services such as SOCKS [20], Reliable Sockets [29], and Migrate [23, 24]. Each Upstream of these systems provides only a specific service, how- ever, not an architecture usable by third parties. Application Application

Conductor [28] traps application network operations f f and transparently layers composable “adaptors” on TCP TESLA TESLA connections, but its focus is on optimizing flows’ per- formance characteristics, not on providing arbitrary ad- Encryption Encryption ditional services. Similarly, Protocol Boosters [13] pro- g poses interposing transparent agents between communi- cation endpoints to improve performance over particu- g Migration lar links (e.g., compression or forward error correction). While Protocol Boosters were originally implemented in the FreeBSD and Linux kernel, they are an excel- h1 h2 hn lent example of a service that could be implemented in C library C library a generic fashion using TESLA. Thain and Livny re- cently proposed Bypass, a dynamic-library based inter- position toolkit for building split-execution agents com- Downstream monly found in distributed systems [25]. However, be- Figure 2.1: Two TESLAFigure stacks 2: are Two shown TESLA to illustratestacks. The how encryption session-layerflow services han- may be put together. cause of their generality, none of these systems provides The stack on the leftdler supports implements presentation input fl layerow semantics,with output whileflow the. otherThe mi- supports disconnection any assistance in building modular network services, par- management. Note: Thisgration imageflow has handler been taken implements from the input authors’flow researchwith output paper [12]. ticularly at the session layer. To the best of our knowl- flows . edge, TESLA is the first interposition toolkit to specifi- cally support generic session-layer networktransport services. flow, say using network 2.

The authors argue thattions implementing on the byte stream, the TESLA such as library transparent in thefl userow migra-space is preferable as this tion, encryption, compression, etc. 3 Architecture eases deployment andA consequentlyflow handler, adoption illustrated by in the Figure users. 1, is explicitly de- fined and constructed to operate on only one input flow As discussed previously, many tools existPerformance that, like evaluationfrom of an theupstream implementationhandler (or and end its comparison application), to and session-semantics is imple- TESLA, provide an interposition (or “shim”)mented layer as part be- of thehence application devoid of shows any demultiplexing that TESLA does operations. incur some Concep- small overhead when it tween applications and operating system kernels or li- tually, therefore, one might think of a flow handler as braries. However, TESLA raises the levelcomes of abstraction to achievabledealing throughput. with traf Howeverfic corresponding in case of to latency, a single TESLA socket performs only at par with all of programming for session-layer services. It does so by (as opposed to an interposition layer coded from scratch, introducing a flow handler as the main objectapplication manipu- based implementationswhich must potentially of session deal semantics. with operations on all open lated by session services. Each session service is imple- file descriptors). A flow handler generates zero or more mented as an instantiation of a flow handler,Although and TESLA TESLA definesoutput afl flowows, abstraction which map andone-to-one looks like to downstream TCP on thehan- wire, it was not widely takes care of the plumbing required to allowadopted. session This ser- maydlers be due (or to the the network following send reasons: routine). 1)While it is not a fl backwardsow han- compatible with vices to communicate with one another. dler always has one input flow, multiple flow handlers A network flow is a stream of bytes thatlegacy all share applications, the may unless coexist a stub in a is single used process, for interposition, so they may which easily inhibits share deployment; 2) it global state. (We will expound on this point as we further same logical source and destination (generallyassumes identified that all network stacks implement TESLA as a session layer service and therefore it by source and destination IP addresses, source and desti- discuss the TESLA architecture.) nation port numbers, and transport protocol). Each flow The left stack in Figure 2 illustrates an instance of handler takes as input a single network flow, and pro- TESLA where stream encryption is the only enabled han- duces zero or more network flows as output. Flow han- dler. While the application’s I/O calls appear to be dlers perform some particular operations or transforma- reading and writing plaintext to some flow , in reality CHAPTER 2. RELATED WORK 16

is not backwards compatible and incrementally adoptable with existing network stacks; 3) it

does not cater to the need for session semantics and therefore does not enable a representation

of the conversation between processes, instead the primary focus is on enabling extensions

to transport services (e.g., encryption of flows); and 4) it does not support communication

between more than two processes.

2.1.2 Session Layer Concept for Overlay Networks

Mahieu et al., argue that overlay network solutions expect mobility solutions to maintain open

connections during hand-overs and that developers are burdened with the responsibility of

implementing session management solutions as part of the application [19]. Also these im- plementations typically have certain constraints which do not help in implementing mobility

semantics. With these constraints in view, the authors present four properties which every

mobility management solution should incorporate. These properties are:

a) Mobility events must not be hidden from higher layers; b) Mobility solutions must have an API to allow interaction with the stack and allow applica-

tions to receive feedback as per the subscribe-publish model;

c) Solutions must not be limited to certain transport protocols; and d) Mobility solutions should be able to cope with heterogeneity of networks.

The authors propose two sub-systems; the connection abstraction system and the address man-

agement system. The connection abstraction system develops the concept of sessions for the

application and meets the properties that a mobility management system should possess. It

implements session semantics by exchanging session management information with its peer A Session Layer Concept for Overlay Networks 115

Figure 1. CAS state machine. the session must be resumed with the new connection. Performing a handover, or migrating a session from one network to another is then a sequence of suspending the session on the old network, and resuming it on the new one. Figure 1 depicts the state machine that is used by the CAS. The state machine can be split up in three parts. The Session Establishment part shows the states to establish a ses- sion. Session establishment is implemented as a three-way handshake protocol because the lower layer transport protocol may not offer reliable data transport. If the transport protocol offers a reliable data stream, the states SessionInfoSent and SessionInfoReceived can be omitted. The Session Suspension part depicts the states for both anticipated and unanticipated sus- pension. When suspending unanticipatedly, the state machine is moved immediately to the Suspended state. When suspending in an anticipated way, suspension is notified to the peer, again using a three-way handshake protocol. Upon completion, the transport connection is closed and the session can move to the Suspended state. The right part of the state machine shows the states related to session resumption, which is also realized as a three-way handshake protocol. The peer that tries to establish a new connection moves to the Reconnecting state. When successful, a session resumption request is sent and the state machine is moved to the ResumeRequestSent state. The receiving peer will move to the Negotiating state when it received the request.

3.1.3. Exchange of Session Information Session management requests and replies are exchanged according to protocol stack rules by means of a session header. The header layout is shown in Figure 2. The header contains a CHAPTER 2. RELATED WORK 17 Marker-field which is used to locate the header in a datastream in case stream-based protocols

Figure 2. CAS header format. Figure 2.2: The session layer header used to communicate control information between communicating peers. Note: This image has been taken from the authors’ research paper [19]. using session headers as shown in Figure 2.2. The figure highlights the use of a session ID to identify sessions among multiple instances used by the application along with various control

flags. These session IDs are considered to be abbreviations of the Universal Unique Identifiers

(UUIDs) used by operating systems.

The connection abstraction subsystem is also responsible for handling network failures (i.e., migration of transport flows to different networks in case of failures). It does so by defining notion of logical flows which are mapped onto the transport flow. These migrations are real- ized by defining multiple states — such as connected, suspended, resuming, and terminated

— in which the session may exist. This is akin to the transport layer TCP states. An impor- tant consideration is the management of unacknowledged data which may be present in the transport connection’s buffer when a disconnection occurred. The migration process is also re- sponsible to recovering these bytes and transferring ownership to the new transport flow. This also implies implementation of check pointing semantics. The connection abstraction system supports streaming as well as datagram services. Note that as highlighted in the required prop- erties the connection abstraction system provides feedback to the application by implementing the publish-subscribe model for notifications.

On the other hand, the address management subsystem enables application access to address information facilitating migration and disconnection management. CHAPTER 2. RELATED WORK 18

Although this proposal raises the right questions, yet the entire discussion of session seman- tics is within the context of end-point mobility. Session management of conversations spans a broader landscape. For example, today authentication of SSL connections is done indepen- dently for each transport connection. This is in spite of the fact that web servers (e.g., Apache

Web Server) implements the feature of maintaining session state, however system administra- tors choose not to enable this feature — due to the cost of maintaining session state as part of the application. It is obvious that once a process on an end point has been authenticated, any subsequent transport connection should not require explicit authentication. Thus considering session semantics within the scope of mobility alone may not be the recommended approach.

Like TESLA, this proposal was not widely adopted perhaps because: 1) the proposed design is not backwards compatible with legacy applications, unless a stub is used for interposition, which inhibits deployment; 2) similarly, it assumes that all networks stacks implement the session layer service and therefore it is not backwards compatible and incrementally adopt- able with existing network stacks; 3) the session semantics discussed are limited in scope to mobility concerns alone and thus do not enable a broader representation of the session seman- tics between processes; and 4) it does not support communication between more than two processes.

2.1.3 A Session-Based Architecture for Internet Mobility

Snoeren et al., propose the use of a session layer to deal with the challenges of mobility in contemporary networks [20–22]. They identify the fundamental issues that concern mobility as: 1) host or service location, 2) preserving communication, 3) disconnection, 4) hiberna- tion, and 5) reconnection. To address the challenges of mobile networking, they propose four guidelines: CHAPTER 2. RELATED WORK 19

a) Eliminate the dependence of higher protocol layers upon lower-layer identifiers; b) Avoid prescribing a particular naming scheme;

c) Handle unexpected network disconnections in a graceful way, exposing occurrences to ap-

plications; and d) Provide these services at the mobile nodes themselves.

While the proposed research develops the design of a session layer, which supports fault toler-

ance, disconnection, reconnection, state and context management, yet, these services are all

in the context of mobility. As we discuss in Section 2.1.2, session management that supports

a reasonable set of modern use cases covers a broader landscape. To meet the needs of mod-

ern communication, session semantics must not only consider mobility issues, but also lead

towards the development of a general communication model that for example considers com-

munication between multiple participants. As with the proposals listed above, this research

was perhaps not widely adopted (in the context of session management) because of its limited

scope; the value offered by the proposal was perhaps not sufficient to encourage the wider

public to consider a transition.

2.1.4 Phoebus: A Session Protocol for Dynamic and Heterogeneous Net-

works

Brown et al., present a session protocol that enables setup of a longitudinally hybrid construct

with different transport mechanisms strung together [23, 24]. The scope of this research is limited to high-performance, long-distance networks. Instead of using legacy transport con-

nections end-to-end, Phoebus allows setup of TCP connections to Phoebus gateways (i.e., mid-

dleboxes) which use specific transport protocols — in this case it is a circuit switched network CHAPTER 2. RELATED WORK 20

protocol. An illustration of such a setup is shown in Figure 2.3. The proposal for Phoebus

gateways allows segment-specific transport protocols to hide the details of resource allocation

and use. selection [4, 18, 20]. This work has addressed a num- ber of issues including route asymmetry and optimal, or parallel, route selection. Phoebus differs in that it is pre- sented as an evolution of the Internet architecture, rather than a workaround for ineffective routing policy. (a) TCP Model Performance enhancing proxies (PEPs) are described in RFC 3135 and are commonly used to handle the is- sues that occur when using TCP over wireless or satellite links. However, the focus of PEPs has been to address performance problems on specific, less common links (b) Phoebus Model and does not provide a general solution. To a certain extent, our results are basedFigure on 2.3: the Anna- illustrationFigure of the TCP 1. TCP vs. Phoebus Model model.vs. Phoebus Note: This Model image has been taken from the ture of TCP’s flow control. The researchauthors’ community research paper [23]. has recognized the issues inherent in using TCP over applications to use it more difficult. networks with high bandwidth/delay products. There is While the proposed solutionsI-TCP uses have a protocol been tested similar in real to that world in Phoebus environments de- (i.e., Internet2), a tremendous body of research [7, 11–13, 13–15], too signed to handle the problems inherent in using TCP for vast to poperly cite here, devoted to understandingthe scope of and the researchcommunication is limitedwith to a specific mobile devices challenge [5]. of Mobile maximizing devices throughputs for large improving TCP’s performance. TCP’s performance is often experience large amounts of loss in their normal data transfer over National Research and Education Networks (NRENs). The intention is to widely understood to depend greatly on the end-to-end communication with other hosts, which TCP was not Round-Trip-Time (RTT). We leverage the well under- use available fiber networksdesigned for to long handle. haul I-TCP transfers, breaks by deploying a TCP connection the Phoebus gateways on the tood stability and fariness qualities of TCP by utilizing into two sub-connections, a connection from the wire- it between our own adaptation points inlast-mile the network. segments.less This node minimizes to the edge the of influence the wired of network TCP,which and ais connec- susceptible to poor average Whereas more drastic changes to TCP (such as Explicit tion from the edge of the wired network to the end host. throughput over wide-area networks. Congestion Notification) may take time for their ramifi- This allows for applications to use a protocol optimized cations to be fully understood (and to be deployed ubiq- for the wireless link while still using TCP to connect to uitously), Phoebus is treading in more familiarAs with territory. the proposallegacy by Mahieu end hosts.[19], However, this proposal the focus was not of I-TCP widely was adopted on perhaps because: The NSF-funded DRAGON (Dynamic Resource Al- 1) NRENs are not accessiblebridging wireless to the wider and wired public networking and thus which access means to Phoebus it gateways is not location via GMPLS Optical Networks) project which has some requirements that don’t make sense for a gen- includes collaborators from MAX, Universitypossible; of South- 2) the goalseral for purpose this research protocol. were not to develop session semantics for network ern California (USC) Information Sciences Institute Another method, Optical Burst Switching (OBS), is (ISI) East, and George Mason Universitycommunication attempts in general,a way for and optical thus thenetworks appeal to to ensure the wider better audience utilization was minimal; 3) since to bridge optical and packet networks [6]. The the proposal was towhile address providing a specific a challenge, packet-switched broader interface session semantics to users were not developed DRAGON architecture allows edge hosts to create a [17]. Conceptually, it is very similar to Phoebus. In high-speed packet-switched network link— between e.g., involvement two OBS, of two packets or more from participants different flows in the are communication. queued up. Once points. While this provides a way to allow applica- a minimum number of packets has been queued up, tions to more effectively use high performance opti- the hardware allocates a “burst” and sends the packets. cal networks with no changes to the application itself, By intelligently choosing the minimum and maximum it does not resolve the performance issues inherent to amount of packets to queue up, the hardware can ame- long-distance, high-bandwidth connections. Any lost liorate the cost of session setup to ensure that paths are data must be retransmitted from the source, still forcing allocated only when the cost has been distributed over a redundant long-distance network traversals. Also, the large number of packets. mode necessitates significant configuration of the end When using an OBS network, no buffering of the data hosts to be able to signal and subsequently utilize the is performed. Thus, if packet loss occurs after the data established connection. has traversed the OBS network, the lost data must again The Internet Backplane Protocol (IBP) is a project be retransmitted across the network, taking up valuable from University of Tennessee with the goal of provid- space inside a burst. This end-to-end nature also man- ing storage from nodes in the core of the network. Ap- dates that the end hosts be highly tuned to properly make plications can store their data on these intelligent stor- use of the network. age nodes to help increase the speed of transmission of data to other users, in similar fashion to the Akamai [3] caching model. However, due to IBP’s focus on file- 4 The Phoebus Model oriented data storage, IBP makes it difficult to transfer streams of data, which precludes its use in numerous The idea underlying the Phoebus model is to embed applications and makes transparently adapting existing Phoebus Gateways (PGs) in the network. These gate-

3 CHAPTER 2. RELATED WORK 21

2.1.5 Taking Advantage of Multi-homing with Session Layer Striping

In this research Habib et al., quote various sources to establish that multihoming does indeed improve aggregate throughput achieved by the user application [11]. (In doing so, they ac- knowledge that the last-mile issues do have their influence despite support for multihoming.)

They claim that the benefits of freeing the application developer from implementing striping using multihoming semantics are far greater. 1 1 Application 1 Application 2 Application k 0.9 0.9 0.8 0.8 0.7 0.7 Session 1 Session 2 Session k 0.6 0.6 Session Layer 0.5 0.5 0.4 0.4

Fraction of hosts 0.3 Fraction of hosts 0.3 TCP UDP TCP SCTP 0.2 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (a) Dlatency,loss (b) Djitter,loss Network Network Network 1 Interface 1 Interface 2 Interface n 0.9 0.8 0.7 Figure 2.4: Session layer striping architecture where we have k session having the ability to access n 0.6 0.5 network interfaces. Note: ThisFig. image 1. Session has been layer taken striping from the architecture. authors’ research paper [11]. 0.4 0.3 0.2 εR=0, εJ=0 Fraction of hosts 0.1 εR=5, εJ=5 This method of multihoming is made possible by developing the notion of a virtual flow which 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 in turn is stripedpreferences. onto multiple At the transport session flows layer, which each in turn session preferably in turn use selectsdifferent networks. (c) Dlatency,jitter the appropriate transport protocols and network interfaces to This enables the session layer semantics to be independent of the underlying transportFig. flows. 2. Discordance ratios. Using loss rates (a) as an indicator of end-to- initiate the connection(s). The session layer can act as a end latency and (b) as an indicator of jitter performance fails only 10% of Here the authorsmere presentpass-through an interesting filter for take a transporton the subject layer that connection striping may (e.g., also bethe imple- time for more than 90% of the hosts we probed. (c) indicates that using Session 1 opens a single TCP connection on interface 1), atoleranceassmallas5msintheoptimizationgreatlyreducestheconflicts between end-to-end latency and jitter. mented as partstart of multiple the transport transport-level layer semantics flows and on SCTP several is one interfaces method which (e.g., may facilitate in doing so.Session However 2 this starts ties an the UDP solution flow to on a specific interface transport. 1 and a TCP flow on interface 2), or even rely on a transport protocol that supports Another argumentmultihoming is that since (e.g., the SCTP motivation flow over behind interfaces developing 2 stripingand n). is to improveConflicting per- semantics. One may wonder what the session In the remainder of this section, we elaborate on the formance, it is not far fetched to have applications define expectations for the durationlayer of the should do when it is asked to fulfill conflicting objectives. session layer semantics we envision, before discussing how Consider the example of an application that asks the session conversation.the These semantics expectations are realized can facilitate through the session transport layer layer in defining connection prioritieslayer for the to both minimize packet losses and jitter at the same use of transports.management, A draw back and of sketchingthe multihoming a possible method, implementation also highlighted by and the authors,time. Itis is conceivable that using a given interface consistently API. We stress that rather than a complete solution, our provides a lower loss rate than using the other interfaces, if (significantly more modest) goal is to propose a strawman the interface leads to network paths that traverse routers with architecture and outline the objectives for which we believe a larger buffers. At the same time, large router buffers may result session layer striping protocol should strive. in larger variations in queue sizes, which cause the applications to experience larger end-to-end jitter. In such a case, it is A. Session layer semantics unclear which performance metric the session layer should At the very minimum, the session layer must be able to favor. generalize to multihomed connections the reliability semantics We attempt to grasp the potential magnitude of the problem that the transport layer offers to single-homed connections. by running a very simple set of end-to-end measurements from That is, the session layer should provide a choice between: adual-homedresidentialhosttoover100,000remotehosts. (1) no guarantees on losses or ordering, (2) lossless delivery For each measurement interval, and for each remote host, we without guarantees on ordering, (3) in-order delivery without measure the latency, jitter and loss rates experienced between guarantees on losses, and (4) in-order and lossless delivery. In each interface and the remote host. We define the discordance addition to (1) and (4), which mirror possible transport layer ratio between two metrics as the fraction of time during which semantics for single-homed connections, (2) can be useful it is impossible to select a single interface to optimize for both for applications, such as file transfer, that can accommodate metrics simultaneously. out-of-order delivery as long as they can reconstruct packet In Figure 2 we plot the cumulative probability distribution of ordering at the application layer, while (3) may benefit media the discordance ratios between latency and loss, jitter and loss, streaming applications. and latency and jitter, and we observe that, in an overwhelming Clearly, the above semantics do not ensure that multihom- number of cases, conflicts resulting from trying to optimize for ing results in application performance improvement. Thus, different metrics can be avoided. Even when the outcome is in addition, the session layer should select how to stripe bleaker, as is the case in Figure 2(c), we see that if we omit traffic over the different (transport layer) connections to meet optimizations that result in improvements of less than 5 ms in one or more of the following objectives, averaged over the either end-to-end latency or jitter, conflicts remain relatively length of the session: (1) throughput maximization, (2) latency rare. minimization, (3) jitter (i.e., latency variations) minimization, This experiment is very simple (a single source host was and (4) loss minimization. Note that trying to minimize losses used), and, as such, by no means provides a definitive answer only applies when the session layer is requested to provide to the potential problem of conflicting semantics. However, our unreliable delivery. measurements hint that different semantics are so unlikely to CHAPTER 2. RELATED WORK 22 the issue of head-of-line blocking when using striping. (A simple solution to address repeated occurrences of head-of-line blocking experiences is to identify and avoid the use of the network path that causes such hindrances.)

Figure 2.4 illustrates how the implementation enables support for setting up at least k n ⇥ connections where we have k session instances and n network interfaces.

The authors argue in favor of user-space implementations for ease of deployment and adoption by users as opposed to kernel-space implementations.

The primary focus of this proposal is to improve throughput for transporting volumes of data across the network. The intention is to maximize the utilization of resources (i.e., available network interfaces). Although maximizing utilization of resources is a much desired goal, in the context of session management this proposal does not cover the broader spectrum of developing session semantics. Perhaps the reasons for its lack of wider deployment are: 1) the proposed design assumes that all networks stacks implement the session layer striping service and therefore it is not backwards compatible and incrementally adoptable with existing network stacks; 2) the goals discussed are limited in scope to improving throughput alone and thus do not consider a broader discussion of the session semantics between processes; and 3) it does not support communication between more than two processes.

2.1.6 Open Systems Interconnection (OSI) Model

In theory the OSI model proposed a session layer to service the presentation and in turn the application layer [79]. The intention was to provide synchronization services between two presentation layers, which later included services such as temporary management of conversa- tion state, authentication, authorization, and session restoration. The original network stack diagram is shown in Figure 2.5. CHAPTER 2. RELATED WORK 23 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. COM-28, NO. 4, APRIL 1980

is considered beyond the scope of the present layered model. Applications orapplication processes may be of anykind (manual, computerized, industrial, orphysical). 2) The PresentationLayer: The purpose of the Presentation Layer is to provide the set of services which may be selected by the Application Layer to enable it to interpret themeaning of the data exchanged. These services are for the management ofthe entry exchange,display, andcontrol of structured data. Thepresentation service is location-independentand is Figure 2.5: OSI ReferenceFig. Model 13. —The The ISO seven Model layers of Architecture OS1 architecture. for Open Systems Interconnection. considered to be on top of the Session Layer which provides Note: This image has been taken from the authors’ research paper [79]. the service of linking a pairof presentation entities. intermediate nodes) is the last function to be performed in It is through the use of services provided by the Presenta- Perhaps the reason OSI model was not widely adopted was because of the slow development order to provide the totality of the transport service. Thus, tion Layer that applicationsin an OpenSystems Intercon- andthe also upper because layer TCP/IP,at in the the time, transport-service was proving to be an effectivepart of alternate the architec-[80]. However, nection environment can communicate without unacceptable weture will avoid is the speculation Transport on the Layer, subject. sitting on top of the NetworkLayer. costsin interface variability, transformations,or application

AlthoughThis theTransport general opinion Layer is that relieves the recommendations higher layer of the entities OSI model from have any been merely modification. concern with the transportation of data between them. 3) The Session Layer: The purpose of the Session Layer is academic, yet, the notion of a session layer and maintaining state to assist with semantics of a 5) Inorder to bind/unbind distributed activities into a to assist in the support of the interactionsbetween, cooperating dialoguelogical are significantrelationship contributions. thatcontrols However, the with data contemporary useexchange cases, the recommen-with presentationentities. To do this, the Session Layerprovides dationsrespect of the OSIto modelsynchronization would prove insufficient.and structure, Considerthe for example theneed implementation for a services which are classified into thefollowing two categories. of statededicated managementlayer developedhas been identified. for Apple Continuity [5]. TheSo librarytheapplication implementationof allows a) Binding two presentation entities into a relationship multipleprinciples devices/ end-points3 and 4 to beleads part ofto the the same establishment session — use cases of suchthesession as these highlight andunbinding them. This is called session administration theLayer dire need which of a communication is on top modelof the that Transport considers twoLayer. or more participants. The OSI service. 6) The remaining set of general interest functions are those b)Control of data exchange, delimiting,and syn- model, unfortunately, does not cater to such requirements. related to representationand manipulation of structured chronizing data operations between two presentation entities. datafor the benefit of application programs. Application This is called session dialogue service. of principles 3 and 4 leads to identification of a Presentation Toimplement the transfer ofdata between presentation Layer on top of theSession Layer. entities, the Session Layer may employ the services provided 7) Finally, there are applications consisting of application by the TransportLayer. processes whichperform information processing. A portion 4) The Transport Layer: The Transport Layer exists to pro- ofthese application processes andthe protocols by which vide a universal transport service in association with the under- theycommunicate comprise the ApplicationLayer as the lying services provided by lower layers. highest layer of the architecture. The Transport Layer provides transparent transfer of data The resulting architecture with seven layers, illustrated in between session entities.The Transport Layer relieves these Fig. 13 obeysprinciples 1 and 2. session entitiesfrom any concern with the detailed way in A moredetailed definition of each ofthe seven layers which reliable and cost-effective transfer of datais achieved. identified above is given in the followingsections, starting TheTransport Layer is required to optimizethe use of from the top with the application layer described in Section available communications services to provide the performance VI-C1) down to the physical layer described in Section VI-C7). required foreach connection between session entitiesat a minimum cost. C. Overview of the Seven Layers of the OSI Architecture 5) The Network Layer: The Network Layer provides func- 1) The Application Layer: This is the highest layer in the tionaland procedural means to exchange network service OS1 Architecture. Protocols of this layer directlyserve the end dataunits between two transport entities over a network user by providing the distributed information service appro- connection. It provides transport entities with independence priate toan application, to itsmanagement, and to system from routing andswitching considerations. management. Management of Open Systems Interconnection 6) The Data Link Layer:The purpose of the Datalink Layer comprises thosefunctions required to initiate,maintain, is to provide the functional and procedural meansto establish, terminate,and record dataconcerning the establishment of maintain, and release data links between network entities. connectionsfor data transfer among application processes. 7) The Physical Layer: The Physical Layer provides mech- The other layers exist only to support this layer. anical,electrical, functional,and procedural characteristics An application is composed of cooperating application to establish, maintain, and release physical connections (e.g., processes whichintercommunicate according to application data circuits) between data link entities. layer protocols. Application processes are the ultimate source and sink for data exchanged. VII. OS1 PROTOCOLS DEVELOPMENTS A portion of an application process is manifested in the application layer as theexecution of application protocol The model of OS1 Archtecture defines the services pro- (i.e., application entity). The rest of the application process vided by each layer to the next higher layer, and offers con- CHAPTER 2. RELATED WORK 24

2.1.7 Session-Initiation Protocol

The Session-Initiation Protocol (SIP) [81], is one of the well known session-oriented commu- nication protocols. As the name suggests, SIP is a protocol that assists with the setup and tear down of a session. However, it does not engage during the conversation. Although SIP does not present itself as an extension to the networking stack, it is relevant to our research since we adopt a session-based communication model and it behooves us to learn from the experi- ences of SIP and its implementation. SIP is primarily geared towards multimedia services and is the corner stone of most VoIP services. It handles the call management, while the voice and video data is transferred using RTP [82]. An illustration of a call setup using SIP is shown in Figure 2.6RFC. 3261 SIP: Session Initiation Protocol June 2002

atlanta.com . . . biloxi.com . proxy proxy . . . Alice’s ...... Bob’s softphone SIP Phone | | | | | INVITE F1 | | | |------>| INVITE F2 | | | 100 Trying F3 |------>| INVITE F4 | |<------| 100 Trying F5 |------>| | |<------| 180 Ringing F6 | | | 180 Ringing F7 |<------| | 180 Ringing F8 |<------| 200 OK F9 | |<------| 200 OK F10 |<------| | 200 OK F11 |<------| | |<------| | | | ACK F12 | |------>| | Media Session | |<======>| | BYE F13 | |<------| | 200 OK F14 | |------>| | |

Figure 1: SIP session setup example with SIP trapezoid Figure 2.6: An illustration of the SIP session setup with SIP trapezoid. Note: This image has been taken from the RFC 3261 INVITE81 . sip:[email protected] SIP/2.0 Via:[ SIP/2.0/UDP] pc33.atlanta.com;branch=z9hG4bK776asdhds Max-Forwards: 70 To: Bob Although SIP hasFrom: seen Alice tremendous ;tag=1928301774 success with call services, it hasn’t been adopted for use Call-ID: [email protected] CSeq: 314159 INVITE Contact: Content-Type: application/sdp Content-Length: 142

(Alice’s SDP not shown)

The first line of the text-encoded message contains the method name (INVITE). The lines that follow are a list of header fields. This example contains a minimum required set. The header fields are briefly described below:

Rosenberg, et. al. Standards Track [Page 12] CHAPTER 2. RELATED WORK 25

outside of multimedia. This may be because, SIP does not present constructs or abstractions

that describe the communication and instead focus on the call management. SIP leaves the

decision of describing the communication session, the available resources, their configuration

and the communication protocol for the application developer. While this gives the developers

complete liberty in how to implement communications, we are essentially back to square one,

as with the legacy network stack where there is limited support for modern use cases, or there

are constraints introduced due to the limiting assumptions of the legacy stack implementations.

In other words, SIP simply builds on top of existing stacks and focuses on the setup and tear

down of sessions. Thus services built using SIP are not only constrained by the limitations of

the existing stacks, but will incur additional overheads for working around these limitations.

Also, SIP is dependent on the application developer to monitor the session state and assist

with reconfiguring or resetting communication. Although SIP has not seen adoption outside

of multimedia (perhaps due to the challenges listed above), yet it presents useful insights into

session management. We try to leverage the lessons learned by SIP and consider them while

conducting our research.

2.2 Transport-Layer Proposals

The proposals discussed below suggest extensions to the transport layer in the network stack

to enable richer functionality. As discussed earlier, the general motivation is to enable modern

use cases by supporting features that are not provided by the legacy network stack. CHAPTER 2. RELATED WORK 26

2.2.1 Structured Stream Transport

Structured Streams [18], in 2007, proposes a transport abstraction as an alternate to TCP.An illustration of this abstraction is presented in Figure 2.7. The intent is to create associated child transport streams from existing transport connections while incurring minimal cost — this results in a hierarchical hereditary structure. It allows the application to have parallel streams. Each stream is capable of independent data transfers and therefore avoids issues such as head-of-line blocking. The ability to create independent child streams enables multiplexing onto communication channels which the application can use as a logical stream, or may pack data into messages with defined boundaries.

Figure 3: SST Packet Layout

sequence numbers work like those IPsec uses for replay pro- tection in the AH and ESP [30]. While IPsec “hides” its sequencing semantics from upper layers in the interest of operating invisibly to existing transports, one of SST’s de- sign insights is that this sequencing model provides a useful Figure 2.7: An illustrationFigure of 2:the structuredSST Communication streams abstraction. Note: Abstractions This image has been takenbuilding from block for new, richer transport services. the authors’ research paper [18]. The following sections detail the channel protocol’s iden- connection instance or channel,andtoassignmonotonically tification, sequencing, security, acknowledgment, and con- In spite of itsincreasing novel contributions,packet sequence the proposed numbers reliableto stream all packets abstraction, transmit- has not been widelygestion control services. While one packet submitted to ted within a particular channel. The channel protocol also the channel protocol currently translates to one packet in adopted, perhaps because: 1) it is not backwards compatible with applications or legacy stacks, attaches acknowledgment information to packets to deter- the underlying protocol (e.g., IP or UDP), the channel pro- and also thatmine middleboxes when packets do not tend have to let arrived traffic throughsuccessfully,[67], which and usesdoes not this look liketocol TCP could be extended to provide services such as chunk acknowledgment information internally to implement con- bundling [49] or quality-of-service enhancement [50]. gestion control at channel granularity. Finally, the channel protocol protects each packet with a message authenticator 4.2.1 Channel Identification and an optional encryption wrapper to provide end-to-end SST’s negotiation protocol sets up a channel’s initial state security for all packets transmitted over the channel. when an application first connects to a target host: each run The stream protocol builds on this intermediate chan- of the negotiation protocol yields a new channel. As part nel abstraction to provide TCP-like reliable byte streams to of channel negotiation, each host assigns a channel ID for the application. The stream protocol handles segmentation each direction of flow. Each transmitted packet includes the and reassembly, retransmission of lost segments, receiver- receiver’s channel ID, enabling the receiver to find the asso- directed flow control, and substream creation and manage- ciated channel instance state; channel IDs thus play the role ment. At any point in time, the stream protocol normally of IPsec’s security parameters index (SPI). A channel’s ID multiplexes all streams between the same pair of hosts onto may be reassigned immediately once the channel’s lifetime asinglechannel.Thefactthatachannel’ssequencenumber ends: the packet authenticator described below rejects old space does not wrap facilitates ecient stream creation and or replayed packets from prior channel instances. As SST’s termination, but this property also implies that the stream channel protocol is designed for use by the stream proto- protocol must migrate long-lived streams from one channel col, which needs at most a few active channels at once, the to another to give streams unlimited lifetimes as in TCP. channel ID field may be relatively small (currently 8 bits). Figure 3 shows the layout of a typical SST packet. The Channel zero is reserved for the negotiation protocol. channel header is always eight bytes in the current version of SST, and includes fields for channel identification, packet 4.2.2 Packet Sequencing and Replay Protection sequencing, and acknowledgment. The stream header is typ- TCP uses its byte sequence numbers for three dierent ically four or eight bytes depending on packet type, and con- purposes: to distinguish and order data segments within tains fields for stream identification, receive window control, aconnectioninstance,todistinguishsegmentsbelongingto and sequencing bytes within a stream. Following the appli- entirely dierent connection instances [51,53], and to protect cation payload, the channel protocol’s trailing message au- against packet forgery [5]. SST uses its packet sequence thenticator is typically four bytes for lightweight TCP-grade numbers for only the first purpose, leaving the other two security, and 12 or more bytes for cryptographic security. functions to an explicit packet authenticator. The next section details SST’s channel protocol, and Sec- Each channel has a 64-bit packet sequence number space tion 4.3 describes the stream protocol. Finally, Section 4.4 for each direction of flow, from which the channel proto- briefly outlines the negotiation protocol. col assigns sequence numbers consecutively to each packet transmitted. As in IPsec or DCCP [32], every packet sent 4.2 Channel Protocol gets a new sequence number, including acknowledgments The channel protocol’s purpose is to provide transport ser- and retransmissions. The sequence number space does not vices that are independent of how the application “frames” wrap, so if a host sends 264 packets on one channel, it nego- its communication into protocol data units [14], or streams tiates and switches to a new channel with a fresh space. in SST. The communication state required to provide these Like IPsec’s 64-bit sequence number extension [30], SST services is thus shareable across many application streams. authenticates packets using full sequence numbers but trans- SST borrows its connection state, sequencing, and packet mits only the low bits explicitly in each packet. While the security model from IPsec [31] instead of TCP. SST channels size of the sequence number space limits a channel’s total are inspired by IPsec security associations, and SST packet lifetime, the size of the sequence number field in the packet

4 CHAPTER 2. RELATED WORK 27 on the wire, unless it is tunneled through another transport protocol (e.g., TCP or UDP), which takes away much of the advantage of the abstraction; 2) the focus of the proposal is towards a new transport layer, and does not address issues of session semantics thus not presenting enough value in terms of modern use cases to encourage users to migrate; 3) it doesn’t support features such as multi-homing; and 4) it does not support communication between multiple participants.

2.2.2 TNG: Transport Next Generation

In this research [34,35] the authors suggest defining a new conceptual framework for Internet transports. They motivate this by the following arguments: a) concurrent multipath trans- port requires that congestion control procedures are separated from streams accessible to the application; b) sharing of congestion-state for the same path across multiple contexts avoid explosion of congestion control contexts maintained by the stack; c) separating the end-point naming mechanisms from the application naming procedures allows intermediate entities such as firewalls and NATs to apply application-sensitive administrative policies; and d) end to end security mechanisms need to be revamped since IP-level mechanisms such as IP-sec interfere with application gateways and other transport layer procedures.

For these reasons they propose classifying the transport layer functionalities into four indep- endent layers. This is summarized in Figure 2.8. The end-point related functions are defined as part of the end-point layer; the performance related functions and congestion control mech- anisms are consolidated in the flow-regulation layer; the end-to-end security functions, which are presented as optional by the authors, form the isolation layer, and the end-to-end semantics such as that of reliability and ordering define the semantic layer.

This classification is intended to realize functional requirements such as multi-homing, back- Flow Splitting with Fate Sharing in aNextGenerationTransportServicesArchitecture UNPUBLISHED DRAFT

Janardhan Iyengar Bryan Ford Franklin and Marshall CollegeCHAPTER 2. RELATED WORKMax Planck Institute for Software Systems 28 [email protected] [email protected]

ABSTRACT The challenges of optimizing end-to-end performance over diverse Internet paths has driven widespread adoption of in- path optimizers, which can destructively interfere with TCP’s end-to-end semantics and with each other, and are incom- patible with end-to-end IPsec. We identify the architectural cause of these conflicts and resolve them in Tng,anexper- imental next-generation transport services architecture,by factoring congestion control from end-to-end semantic func- tions. Through a technique we call queueFigure sharing 2.8:,T Separatingng en- the applicationFigure and network 1: Tng levelArchitecture semantics of Layering the transport layer. Note: This ables in-path devices to interpose on, split,image and has optimize been taken from the authors’ research paper [35]. congestion controlled flows without affecting or seeing the and optimize congestion control behavior, without interfer- end-to-end content riding these flows. Simulationsward compatibility show that for networking with, applications or even seeing and the transport protocol protocols, headersthe for, ability end-to-en to differenti-d Tng’s decoupling cleanly addresses several common perfor- functions such as reliability. We develop this approach in mance problems, such as communication overate between lossy wireless legacy transportthe context protocols of T andng,anexperimentalnext-generationtrans- TNG, resolve the issue of coupled interface and links and reduction of buffering-induced latencyend-point on residenamingn- and assistport developing that builds on end-to-end ideas introduced authentication earlier and [42,44] privacy to addr solutions.ess Also tial links. A working prototype and several incremental de- abroaderclassoftransportissues. ployment paths suggest Tng’s practicality.non-functional requirementsTng suchbreaks as load transports balancing into over four multiple layers, paths, shown TCP in Figure friendliness, 1. con- Tng’s Semantic Layer implements end-to-end abstractions gestion state sharing, support for small transactions such as that of HTTP GET etc are claimed 1. INTRODUCTION such as reliable byte streams; its optional Isolation Layer Ever since TCP congestion control wasto introduced be realized [56], by this structure.protects upper end-to-end layers from in-path interference; we have found reasons to tweak it within the network. Per- its Flow Regulation Layer factors out performance concerns formance enhancing proxies (PEPs) [16] improveAlthough, TCP’s the authorspoor madesuch a as significant congestion contribution control to by enable presenting performance a perspective manage- on the func- performance over loss-prone wireless links [109], intermit- ment by PEPs; and its Endpoint Layer factors out endpoint tent mobile links [8], and high-latency satellitetional roles links of [26 the]. transportnaming layer, concerns the proposal such ashas port not numbers been widely to enable adopted. clean The NAT/ reasons for Due to their effectivenessand ease of deployment, PEPs now firewall traversal [41]. We make no claim that Tng repre- this lack of adoption may include: 1) lack of backwards compatibility with applications as well form the technical foundation of a booming $1 billion WAN sents “the ideal architecture,” but use it here only to develop optimization market [71], and are joiningas the the growing network class stack, sinceacleanersolutiontotheproblemofPEPs. the protocol does not look like TCP on the wire, unless a stub is of middleboxes such as firewalls [45], NATs [91], and flow- In this paper, we develop Tng’s Flow Layer to enable arXiv:0912.0921v1 [cs.NI] 4 Dec 2009 aware routers [84] pervading the Internet. used for interposition, whichPEPs in in itself the path inhibits to interpose deployment; on or 2)split it doesFlow not Layer cater sessions, to the need for PEPs are compatible with thesession end-to-end semantics prin- and thereforemuch like does traditional not enable PEPs a representation often split TCPof the sessions conversation [16]. between ciple [86], which argues that reliability mechanisms need to Since Tng’s end-to-end security and reliability functions are be end-to-end but explicitly allows for in-networkprocesses, insteadmecha- the primaryimplemented focus is onseparately delineating in higher the functional layers, this rolesflow of the splitting transport later nisms to enhance performance as long as they do not replace avoids interfering with higher end-to-end functions. Tng’s end-to-end reliability checks. Because thein Internet’s terms of modernarchi- use cases;end-to-end and 4) layers it does treat not Flow support Layer communication sessions as “soft between state, more” than tecture lumps congestion control with end-to-endtwo processes. reliability and can restart a flow that fails due to a PEP crash or network in the transport layer, however, PEPs in the path cannot af- topology change, preserving end-to-end reliability and fate- fect one function without interfering with the other. Many sharing. A key technical challenge flow splitting presents PEPs violate fate-sharing [27] by introducing “hard state” is joining the congestion control loops of consecutive path in the network, causing application-visible failures if a PEP sections to yield end-to-end congestion control over the full crashes. All PEPs are incompatible with transport-neutral path, a challenge we solve via a simple but effective tech- security mechanisms such as end-to-end IPsec [63], which nique we call queue sharing. prevent the PEP from seeing the relevant transport headers. Through simulations we demonstrate that flow splitting Our novel solution to this architectural dilemma is to refac- via queue sharing can effectively address a variety of com- tor the transport layer so that PEPs can cleanly interpose on mon performance issues, such as optimizing the performance

1

RFC 4960 Stream Control Transmission Protocol September 2007 CHAPTER 2. RELATED WORK 29

1.2. Architectural View of SCTP

SCTP is viewed as a layer between the SCTP user application ("SCTP 2.2.3 Stream user" for Control short) and Transmission a connectionless Protocol packet network service such as IP. The remainder of this document assumes SCTP runs on top of IP. The basic service offered by SCTP is the reliable transfer of user messages between peer SCTP users. It performs this service within The Stream the Control context Transmission of an association Protocol between (SCTP) two[ 27SCTP], definedendpoints. in 2000, Section updated 10 in 2007 is of this document sketches the API that should exist at the boundary intended to between be an alternativethe SCTP and transport the SCTP protocol.user layers. The protocol enables multi-homing in that SCTP is connection-oriented in nature, but the SCTP association is a the application broader may concept choose than to sendthe TCP data connection. via one interface SCTP provides or the other.the means An SCPfor association is each SCTP endpoint (Section 1.3) to provide the other endpoint illustrated in(during Figure association2.9. The protocol, startup) unlike with a TCP,proposes list of transport the use addresses of message boundaries; in- (i.e., multiple IP addresses in combination with an SCTP port) through which that endpoint can be reached and from which it will stead of sending originate data SCTP as a packets. stream, the The application association packages spans transfers payloads over as independent all of messages the possible source/destination combinations that may be generated called chunks, from whicheach endpoint’s are sent via lists. one of the available network interfaces.

______| SCTP User | | SCTP User | | Application | | Application | |------| |------| | SCTP | | SCTP | | Transport | | Transport | | Service | | Service | |------| |------| | |One or more ---- One or more| | | IP Network |IP address \/ IP address| IP Network | | Service |appearances /\ appearances| Service | |______| ---- |______|

SCTP Node A |<------Network transport ------>| SCTP Node B

Figure 1: An SCTP Association Figure 2.9: An SCTP association. Note: This image has been taken from the RFC 4960 [27]. 1.3. Key Terms

Some of the language used to describe SCTP has been introduced in the In spite of supportingprevious sections. multi-homing, This section SCTP hasprovides not been a consolidated adopted widely list of perhaps the because: 1) key terms and their definitions. it uses headers that are different from TCP and therefore middleboxes do not let its packets o Active destination transport address: A transport address on a through 67 ;peer 2) itendpoint is not backwardsthat a transmitting compatible endpoint with legacyconsiders applications available orfor TCP IP stacks; [ ] receiving user messages. / and 3) it only supports the communication model where a pair of participants communicate and does not consider modern scenarios where multiple processes may be participating in Stewart Standards Track [Page 6] conversations. CHAPTER 2. RELATED WORK 30

2.2.4 Multipath TCP

Multipath TCP [28–30,83–85], introduced in 2006 and proposed in 2011, suggests extensions to TCP to support multi-homing, fault tolerance and flow migration. Since the proposed exten-

sions are based around TCP,they are backwards compatible. To use the extended functionality

the applications must be modified; using the legacy API does not translate into access to all

RFC 6824 Multipath TCP January 2013 features. An illustration of the MPTCP stack is shown in Figure 2.10.

+------+ | Application | +------+ +------+ | Application | | MPTCP | +------+ + ------+ ------+ | TCP | | Subflow (TCP) | Subflow (TCP) | +------+ +------+ | IP | | IP | IP | +------+ +------+

Figure 1: Comparison of Standard TCP and MPTCP Protocol Stacks Figure 2.10: An illustration of the MPTCP network stack and its comparison with the standard TCP stack. Note:1.3. This Terminology image has been taken from RFC 6824 [85].

This document makes use of a number of terms that are either MPTCP- Although Multipath specific TCPor have does defined not deal meaning with session in the semantics context of (i.e., MPTCP, does as not follows: propose a session Path: A sequence of links between a sender and a receiver, defined or another communication in this context abstraction) by a 4-tuple and of there source is no and support destination for communication address/ involving port pairs. more two participants, it is beginning to gain traction in the networking community as well as Subflow: A flow of TCP segments operating over an individual path, in the industry which[86] .forms Multipath part TCPof a doeslarger support MPTCP reconfigurationconnection. A ofsubflow flow setupsis for some use started and terminated similar to a regular TCP connection.

cases for example (MPTCP) flow Connection: migration, A but set not of generalone or reconfiguration.more subflows, over The which focus ofan the proposal is application can communicate between two hosts. There is a one-to- primarily on extending one mapping TCP between and not a about connection higher-level and an abstractions, application whichsocket. may enable modern

use cases. Data-level: The payload data is nominally transferred over a connection, which in turn is transported over subflows. Thus, the term "data-level" is synonymous with "connection level", in contrast to "subflow-level", which refers to properties of an individual subflow.

2.3 Network-Stack Token: A locally unique Extensions identifier given to a multipath connection by a host. May also be referred to as a "Connection ID".

Host: An end host operating an MPTCP implementation, and either In contrast with initiating the transport or accepting layer proposals an MPTCP presented connection. in the previous section, network stack In addition to these terms, note that MPTCP’s interpretation of, and extensions lookeffect towards on, regular adding single-path functionality TCP to semantics layers in are addition discussed to the in transport layer. In Section 4.

Ford, et al. Experimental [Page 6] CHAPTER 2. RELATED WORK 31

some cases the proposals are for adding functionality as services for the transport layer (e.g.,

SERVAL [26]), or changing the existing naming mechanisms work (e.g., HIP [39]).

2.3.1 SERVAL: An End-host Stack for Service-centric Networking

SERVAL, an end-host stack for service-centric networking [26], proposed in 2012, put forward a service-access layer to: 1) enable end-points to use multiple network addresses, 2) to be able

to migrate flows across interfaces, and 3) to create multiple flows for communication. SERVAL

suggests that the proposed layer be between the transport and IP layer to reduce the coupling

between the two and enable the features listed above. An illustration of this placement is

shown in Figure 2.11.

ets, as opposed to the traditional five-tuple. Figure 2 Applica5on) User) Service) Controller) shows the location of flowIDs in the SAL header. Socket) Space) Remote) bind())))))close()) Service' Service) Network-layer oblivious: By forgoing the traditional Controller) Ac5ve) Control'API' five-tuple, Serval can identify flows without knowing the Sockets) network-layer addressing scheme. This allows Serval to Data)Delivery) Transport) transparently support both IPv4 and IPv6, without the Kernel) FlowID' Socket' ServiceID' Ac7on' Sock/Addr' need to expose alternative APIs for each address family. Service) Network) Flow'Table' Service'Table' Access) Mobility and multiple paths: FlowIDs help identify Stack) flows across a variety of dynamic events. Such events Dest'Address' Next'Hop' include flows being directed to alternate interfaces or the Connected) SYN) Network) Flow) IP'Forwarding'Table' Datagram) change of an interface’s address (even from IPv4 to IPv6, or vice versa), which may occur to either flow end-point. Serval can also associate multiple flows with each socket in order to stripe connections across multipleFigure paths. 2.11: An illustrationFigure of 3: the Serval SERVAL network network stack stack with highlighting service-level the separation con- of service-level / Middleboxes and NAT: FlowIDs helpdata when and interact- control. Note:trol Thisdata image plane has split. been taken from the authors’ research paper [26]. ing with middleboxes. For instance, a Serval-aware network-address translator (NAT) rewrites the local grations. Applications interact with the stack via active sender’s network address and flowID. ButThe because fundamental the re- limitationsockets that in tie the socket adoption calls ( ofe.g. this, bind proposaland connect is that the) di- traffic does not look mote destination identifies a flow solely based on its own rectly to service-related events in the stack. These events like TCP on the wire. This is because we know that the middleboxes 67 typically do not let flowID, the Serval sender can migrate between NAT’d cause updates to data-plane state and are also passed[ up] networks (or vice versa), and the destinationtraffic host through can still thatto does the control not look plane like (which TCP or subsequently UDP. Also, if may there use are them no to proxies in place that correctly demultiplex packets. update resolution and registration systems). No transport port numbers: Unlikecan port transform numbers, trafficIn between the rest SERVAL of this section, and legacy we first stacks, describe then how stacks applica- incorporating SERVAL flowIDs do not encode the application protocol; instead, tions interact with the stack through active sockets (§4.1), application protocols are optionally specified in trans- and then continue with detailing the SAL (§4.2) and how port headers. This identifier particularly aids third-party its associated control plane enables extensible service dis- networks and service-oblivious middleboxes, such as di- covery (§4.3). We end the section with describing the recting HTTP trac to transparent web caches unfamiliar SAL’s in-band signaling protocols (§4.4). with the serviceID, while avoiding on-path deep-packet inspection. Application end-points are free to elide or 4.1 Active Sockets misrepresent this identifier, however. By communicating directly on serviceIDs, Serval in- Format and security: By randomizing flowIDs, a host creases the visibility into (and control over) services in could potentially protect against o-path attacks that try the end-host stack. Through active sockets, stack events to hijack or disrupt connections. However, this requires that influence service availability can be tied to a control long flowIDs (e.g., 64 bits) for sucient security, which framework that reconfigures the forwarding state, while would inflate the overhead of the SAL header. Therefore, retaining a familiar application interface. we propose short (32-bit) flowIDs supplemented by long Active sockets retain the standard BSD socket inter- nonces that are exchanged only during connection setup face, and simply define a new sockaddr address family, and migrations (§4.4). as shown in Table 1. More importantly, Serval gener- ates service-related events when applications invoke API 4. The Serval Network Stack calls. A serviceID is automatically registered on a call to We now introduce the Serval network stack, shown in Fig- bind, and unregistered on close, process termination, ure 3. The stack oers a clean service-level control/data or timeout. Although such hooks could be added to to- plane split: the user-space service controller can manage day’s network stack, they would make little sense because service resolution based on policies, listen for service- the stack cannot distinguish one service from another. Be- related events, monitor service performance, and commu- cause servers can bind on serviceID prefixes, they need nicate with other controllers; the Service Access Layer not listen on multiple sockets when they provide multi- (SAL) provides a service-level data plane responsible for ple services or serve content items named from a common connecting to services through forwarding over service prefix. While a new address family does require minimal tables. Once connected, the SAL maps the new flow to its changes to applications, porting applications is straight- socket in the flow table, ensuring incoming packets can forward (§5.3), and a transport-level Serval translator can be demultiplexed. Using in-band signaling, additional support unmodified applications (§7). flows can be added to a connection and connectivity can On a local service registration event, the stack up- be maintained across physical mobility and virtual mi- dates the local service table and notifies the service con- CHAPTER 2. RELATED WORK 32 won’t be able to interact with legacy stacks. Finally, the research does not cater to session se- mantics that might assist with describing modern communications, for example, an abstraction that represents the conversation between processes or support two or more participants in the communication; the focus of SERVAL is towards a service abstraction, not a communication session abstraction.

2.3.2 Congestion Manager

The Congestion Manager (CM) [87], proposed in 2001, does not intend to propose a commu- nication abstraction. The contribution of the research was to solve a different problem — a congestion management framework that spans multiple transport connections. This serves as an extension to the network stack. CM’s placement in the network stack is illustrated in Fig- ure 2.12. CM is relevant to the discussion of a session abstraction for the following reasons. As we will discuss in the following chapters that a conversation between participants/processes may involve more than one stream. In such a scenario, it is pertinent to consider fairness of sharing network resources with other processes. In essence, CM attempts to address a piece of the puzzle and enables the development of a session abstraction.

Applications

HTTP FTP RTP Video RTSP Audio

Transport TCP1 TCP2 UDP Instances

A Congestion P I Manager CM Protocol IP

Figure 2.12: The network stack with the Congestion Manager. Note: This image has been taken from the authors’ research paper [33]. CHAPTER 2. RELATED WORK 33

2.3.3 Mobile IP (v4 and v6)

Mobile IP v4 [88] and v6 [89] are valuable IETF standards that are designed to solve the problem of allowing host mobility while maintaining a permanent IP address. In other words,

the proposal solves the problem of location-independent routing. This is achieved by iden-

tifying the end host through its home network address, regardless of its current location in

the network. While the proposal makes a valuable contribution towards location-independent

routing, it does so my introducing requirements of adding support in the infrastructure (in

the form of both hardware and software). The proposal also introduces challenges of triangle

routing while enabling location-independent routing.

Although the proposal makes valuable contributions, as with other proposals mentioned ear-

lier, we see that the scope of the contribution is focused on the specific problem of location-

independent routing. The proposal does not address larger challenges such as those of de-

scribing modern communications with suitable abstractions. The solution essentially mitigates

a limitation of TCP implementations by working around the problem, instead of addressing the

root cause, i.e., the coupling of transport and network identifiers.

2.3.4 MSOCKS - An Architecture for Transport Layer Mobility

In this research [40], the authors develop the idea of mobility management with the aid of proxy that implements the required semantics. As shown in Figure 2.13 the idea is to split

the communication stack between the mobile node and the proxy to facilitate mobility. This

splitting of the stack enables features such as:

a) Enable support for use of multiple interfaces; b) Enable simultaneous use of multiple transport flows over different network interfaces; CHAPTER 2. RELATED WORK 34

c) Provide the ability to define preferences for each network interface;

d) Implement presentation primitives such as translation, encoding, compression, encryption

etc at the proxy; and

e) Manage disconnections and migration of flows from one network interface to the other.

radio

MN IR Proxy Internet SH

wired Client Network Server App Stack App split network stack

FigureFigure 1: 2.13: A common Network network topology topology showing showing how the the mobile location node of a may proxy interact between with the the mobile proxy whilenode and com- the servermunicating host. with its peer. This implies use of multiple network interfaces. Note: This image has been taken from the authors’ research paper [40].

The idea of MSOCKS is developed by designing a custom protocol between the proxy and the Mobile Node Proxy Static Host mobile node. This protocol allows the mobile node to communicate control information with MSOCK App library Proxy the proxy to realize the features listedMsocket above. Server library kernel sockets

As part of the performanceTCP/IP evaluationTCP splice the authors acknowledge concerns of scalability (e.g., stack performance impact in case of increasing number of transports). In addition, the authors dis- network interface cuss an important aspect of dealing with disconnections and migrations i.e., how quickly does

the implementation react to a disconnection and migrate its flows to alternate network inter- Figure 2: The MSOCKS+ architecture. Parts shown in gray are where MSOCKS alterations are made to thefaces. standard The parts results of proxy show based that the client/server implementation system. does not incur significant overhead in terms of both achievable throughput and latency while providing the required features.

Although the authors address issues that are pertinent to contemporary use cases — e.g., use of Mobile ClientProxy Static Server multiple network interfaces A — the thrust ofCD this research is towardsB mobility management. The Syn Syn, Ack proposal presents a solutionconnect() to the problem,Ack however the need for deploying split-stack proxies connect: addr, port#

authentication checks Syn connect() Syn, Ack

Mconnect() Ack splice set up OK:conn_id

data data

data, Ack data, Ack

Figure 3: Packet exchange diagram for connection establishment between a MSOCKS client and a corre- spondent host via a MSOCKS proxy.

18 CHAPTER 2. RELATED WORK 35 all over the network, raises serious challenges for deployment. Also, the applications that can engage the split-stack proxies will be the only ones benefiting from the solution, rendering the proposal incompatible with legacy applications. Moreover, the proposed extensions do not address challenges of session semantics.

2.3.5 Host Identity Protocol

Moskowitz et al., proposed the Host Identity Protocol (HIP) in 2006 [37, 38], which has been recently updated in 2015 [39]. The primary focus of this research is to address an issue of mobile networking. The intention is to allow hosts to maintain shared IP-layer state, which removes the coupling between the locater and identifier roles of the network address. This allows the network application to continue communication in spite of change of network (i.e.,

IP) address. HIP is based on the a Diffie-Hellman key exchange, using public key identifiers from a new Host Identity namespace for peer authentication.

While the proposal addresses an important problem for modern use cases, the solution ad- dresses one aspect of a larger problem, which is to develop and implement a framework for a communication model that describes modern communication. Nevertheless, HIP is relevant to the discussion of mobile networking.

Note that since HIP introduces IPv4 UDP pseudo headers (or IPv6 pseudo headers), legacy stacks stacks that are not configured to expect them, will not be able to parse the packets and therefore communication will fail. As we have explained earlier and discuss later, this proposal addresses symptoms caused due to the limiting assumptions of TCP/IP stack implementations. The approach here is to solve mobility issues and not to address the root cause of the prob- lem (but instead to work around them). This reduces the motivation to adopt this proposal, specially when it is not backwards compatible with legacy stacks. CHAPTER 2. RELATED WORK 36

2.3.6 Middlebox Communication (MIDCOM) Protocol Semantics

Network researchers have been studying interaction of end-hosts with middleboxes [41–45]. Here, the authors argue that the middleboxes should be application agnostic (i.e., they should

not be required to maintain application intelligence to assist to the fullest). For this reason, they

propose an architecture and a framework to allow trusted entities — referred to as MIDCOM

agents — to assist middleboxes in meeting their objectives without incorporating application

intelligence in the middleboxes. The MIDCOM agents may reside on end-hosts, proxies or ap-

plication gateways depending upon the circumstances. Essentially, the intention is to explicitly

control middleboxes.

Their goal is to promote middleboxes to be first-class citizens in the realm of networking. With

agents MIDCOM deployed at end-hosts, the middleboxes can assist in constructing communi-

cations, without having to understand the application logic; this is essentially what software

defined networking [15] has been able to achieve in a different manner within the scope of a single autonomous system.

2.4 Clean-Slate Designs

The networking community seems divided over the approach to incorporate added functional-

ity to the network stack. Some argue that support for modern use cases should be developed

in a backwards compatible manner, while others argue that doing so is not possible due the

baggage that comes along with the limitations of existing designs and implementations. Folks

that argue in favor of the later position propose clean-slate designs. We present two of these

proposals below. CHAPTER 2. RELATED WORK 37

2.4.1 Networking is IPC: A guiding principle to a better Internet

Day et al., argue that instead of attempting to apply a "fix, patch, or a point solution" to the

challenges of modern networking, there is a need to take a fresh, comprehensive and general

approach towards networking [51]. They suggest that networking is not a set of layers with the services of application relaying (e.g., mail distribu- 3.different PROPOSED functions, instead, it is IPC-BASED a single layer of "distributed NETWORK inter-process communication", tion and similar services), transaction processing, and ARCHITECTURE peer-to-peer. This removes the barrier created by the which repeats over different scopes. An illustration of the single layer of IPC is presented in Transport Layer in the current Internet, opening po- 3.1Figure 2.14 Elements. of a Two-System Scenario tential new markets for ISPs to provide IPC services Host 1 Host 2 directly to their customers leveraging their expertise in Application Application resource management of lower layers. Process Process

Application Application 7. Perhaps most surprising, it turns out that private net- Protocol Protocol Distributed Port ID works (with private addresses) are the norm—IPC pro- Port ID IPC Facility (IPC Layer) cesses are identified by addresses internal to the dis- IPC tributed IPC facility—and public networks are simply Process RIEP (Subsystem) RIEP a degenerate case of a private network. This lays the IPC RIB IPC RIB Manager Manager foundation for major competition and innovation and EFCP EFCP avoids the tyranny of the current Internet structure. EFCP EFCP

Mux Mux 2. BACK TO BASICS: NETWORKING IS DISTRIBUTED IPC AND ONLY IPC Physical Link

We all became familiar with the “layered” reference FigureFigure 2.14: A 1: single One layer of layer IPC that consists of IPCof host with consisting user applications and of IPC hosts subsystems. with Note: model of ISO OSI as well as the layered TCP/IP architec- userThis image applications has been taken from the and authors’ IPC research subsystems. paper [51]. ture. In these models, a layer is said to provide a “service” to the layer immediately above it. For example, the transport Note thatFigure applications 1 shows communicate the with elements the help of of the andistributed IPC IPC facility facility, which required imple- layer provides “virtual” end-to-end channels to the applica- forments communication the IPC mechanisms. This between includes the two protocols application that manage distributed processes IPC features in two tion layer, and the internetworking layer provides the trans- hostssuch as thatresource are information directly exchange, connected routing, error by and a flow physical control, multiplexing, link. The andap- se- port layer with “physical” packet delivery across individual plication protocol part of the application processes establish networks making up the Internet. communication using an IPC interface. This IPC interface What’s wrong with this layered model? As Robert Met- allows the source application process to name the destina- calfe’s quote in the paper’s subtitle indicates, we have always tion application process and specify desired properties for known that IPC was the core of the problem, but we some- the communication. Application names should be location how missed what it could tell us. Both the transport and independent, and unlike existing IPC interfaces (notably the internetworking tasks together constitute an IPC service to sockets interface), applications never see addresses. The job application processes. Let us call this an Internet-wide IPC of the IPC facility is to: service. Now, to implement such a service over individual ISP networks, we contend that one needs a similar ISP-wide locate the destination application process using its name, • IPC service over each ISP network. In other words, we need if found2, establish the communication channel and allo- to repeat such an IPC service over dierent regions/scopes. • cate resources required to meet the desired properties3, Of course, an ISP, in turn, may manage its own network return unique port IDs to the application processes to (perhaps large-scale and/or with a significant all-wireless • use to send/receive data over the allocated channel, and component) by implementing IPC layers of narrower scope to release the channel when done. over a number of its own components. We note that we are aware that “recursion” has been Remark: Unlike existing IPC interfaces, it is not necessary recently promoted in network architectures, but to the best to overload port IDs with application-name semantics. Here of our knowledge, this has been limited to tentative pro- a port ID is simply a local, dynamically assigned, identifier posals of repeated functions of existing layers, and how one that identifies one end of a channel/connection at the layer may either reduce duplication or create a “meta”-function boundary. (e.g., error and flow control) that could be re-used in many To accomplish its job, the IPC facility needs mecha- layers, e.g. Touch et al. (2006) [10]. Independently, we have nisms to support the following functions: pursued a general theory to identify patterns in network ar- chitecture [2] (1996). This proposal is based on this dierent an IPC manager to manage the various functions (dis- direction [3]: • cussed below) needed to establish and maintain connec- tions, Application processes communicate via a distributed a Resource Information Exchange Protocol (RIEP) to IPC facility. The processes that make up this facil- • populate a Resource Information Base (RIB) with ap- ity provide a protocol that implements an IPC mech- plication names, addresses, and performance capabili- anism, and a protocol for managing distributed IPC ties, used by various DIF coordination tasks, such as (routing, security and other management tasks). routing, connection management, etc., We need to view what repeats, as an IPC service, which an Error and Flow Control Protocol (EFCP) to support combines transport (flow-based quality-of-service), routing • requested channel properties during data transfer, (multiplexing/relaying), and other management functions. a multiplexing task to eciently use (schedule) the un- This enables each ISP (at any level, small or large) to sell its • derlying IPC facility (communication medium) that is IPC-based services to others, thus promoting competition shared among several connections. and an organized market-driven Internet. 2if the destination application is found but is not available, the IPC facility could start it. 3Resources could be allocated in many dierent ways, including best-eort, dierentiated, or guaranteed services.

2 CHAPTER 2. RELATED WORK 38 curity.

They note that with IPC they imply the general model of communication and do not refer to a particular implementation. Essentially, the larger the scope of the network, the more IPC layers would be needed to to be stacked. This design allows to build networks from smaller and manageable building blocks. The hope is that the proposal would not only support traditional networks but also contemporary use cases.

2.4.2 Internet Indirection Infrastructure

To meet the needs of modern communication, Stoica et al., urge that we reconsider the way we approach networking [90]. They suggest that originally the Internet was designed to provide unicast communication between fixed locations. However, with modern networking this is not the case anymore. They address the challenges of mobile networking as a use case and propose the use of rendezvous-based communication, Internet Indirection Infrastructure (i3), that decouples the act of sending from the act of receiving.

Figure 2.15: Internet Indirection Infrastructure (i3): An illustration of communication between two nodes where the receiver inserts a trigger and the sender sends corresponding data. Note: This image has been taken from the authors’ research paper [90].

An illustration of the rendezvous-based communication is presented in Figure 2.15. The re- CHAPTER 2. RELATED WORK 39

ceiver indicates interest in data by registering triggers in the system, while the sender generates

corresponding data. Such a model decouples the notion of communication labels from the lo-

cation of the end hosts and in doing so mitigates concerns of mobility. Such a model also

describes different communication paradigms (e.g., multicast, broadcast, anycast).

2.5 Discussion

In this chapter we presented different research contributions ranging from session-layer pro-

posals, transport-layer extensions, network-stack modifications, to clean-slate designs. All

these proposals highlighted the fact that there is a dire need for greater functionality and sup-

port from the network stack to enable modern use cases [3, 6]. However, we see that except

for Multipath TCP [29], none of the proposals have gained traction by the wider community, although, the proposed solutions do solve the problems they highlight.

We note that all the proposed solutions, which have not been adopted by the wider commu-

nity, show a common pattern, in that they fall short in terms of one or more of the following

concerns. We summarize the discussion below in Table 2.1. CHAPTER 2. RELATED WORK 40 ] 76 – 3 3 3 3 3 3 3 3 3 3 3 69 [ : subjective). 3 / Our Contributions 7 ] 3 3 7 7 7 7 7 7 7 / / 3 3 90 III [ 7 7 ] 3 7 7 7 7 7 7 7 7 / 3 3 26 [ 7 SERVAL : does not support, ] 7 7 7 7 7 7 3 3 3 3 3 3 29 [ MPTCP ] 3 7 7 7 7 7 7 7 7 / 3 3 18 [ SST 7 can support, / ] 3 does 7 7 7 7 7 7 7 7 7 / 3 27 3 [ SCTP 7 ] 3 3 7 7 7 7 7 7 7 / / 3 3 12 [ 7 7 TESLA ] 3 3 7 7 7 7 7 / / 3 3 3 3 81 SIP [ 7 7 Session abstractions Flow migration Endpoint migration Extensibility Like TCP on the wire Compat. with legacy applications Compat. with TCP stacks Transport independence Fault tolerance Dynamic reconfiguration Multihoming Table 2.1: Comparison of select state of the art vs. our contributions ( Separation of semantics & enabling innovation Backward compatibility Enabling greater functionality CHAPTER 2. RELATED WORK 41

Limited or Missing Communication Model/Abstractions Although the proposals in- • tend to add support for greater functionality, which applications can use to realize mod-

ern use cases, yet they do so in a manner that does not include a model or set of abstrac-

tions that describe modern communication in general. In cases where the proposals do

include abstractions, we see that the scope of these abstractions is limited.

For example, with HIP [39], the focus is on engineering a solution to the challenges of naming end points; no communication abstractions are proposed that may be used

to describe communication in general. Similarly, with structured streams [18], while the authors present excellent insights and a useful transport abstraction, this abstraction

alone is not sufficient to describe the variety of use cases that we see in modern commu-

nication. In other cases where a session layer is proposed to describe communication,

it only presents a limited set of abstractions that address specific concerns. For example

with TESLA [12], the authors present flow abstractions alone, which indicates that the scope of the model is limited to transport semantics.

We acknowledge that the aspiration to have a silver bullet, which describes all modern

use cases is a futile effort. Nevertheless, a communication model (or set of abstractions)

should be able to describe a reasonable set of use cases, for it to appeal the wider com-

munity.

Limited Scope and Point Solutions In most cases, the proposed solutions are aimed at • solving a particular problem, in a specific scenario. In the context of adding support for

modern use cases, the appeal to adopt point-specific solutions is minimal for the wider

audience.

For example, with Phoebus [23], where the proposal is to implement a session layer that enables longitudinally hybrid transport constructs (i.e., packet- and circuit-switched

transports), the appeal for the common user will be minimal. Unlike scientists who have CHAPTER 2. RELATED WORK 42

to move large volumes of data across large geographical distances, the common user

would not be interested in putting in the effort for something that might not be of use.

Although Multipath TCP has been implemented as part of Apple’s iOS [91] and therefore has seen significant adoption, it too addresses problems that are specific to transport

semantics alone. Even if we are to consider a Multipath TCP flow as a session, as with

legacy TCP,the entire session is confined within the scope of a single flow. This precludes

the possibility of using this abstraction to describe communication between more than

two participants or considering more than one flow to be part of the session.

Addressing Symptoms, Not the Root Cause In most cases, where extensions to the • transport layer or the network stack are proposed, the solutions tend to address the

symptoms of the actual problem, instead of resolving the cause of the problem.

For example, with SERVAL [26], the proposal is to introduce flow and service identifiers, which among other things, will address the naming problem due to mobility. As we

understand, the naming problem in case of mobility is caused due to the coupling of

the network address and the transport label — i.e., the use of IP address as both the

network address and part of the transport label. The solution, which addresses the root

cause instead of treating the symptoms, would be to not depend on the IP address as

the transport label and thus avoiding any disruption in communication when there is a

change of IP address due to mobility.

Backwards Compatibility Most of the proposals do not take into consideration the legacy • behind TCP. Today, nearly all of the Internet communication is over TCP; We noted in

August 2014, that the volume of traffic using TCP over the wireless network at Virginia

Tech was 2 TB per day. TCP is ubiquitous when it comes to network communication. It

is inconceivable to consider that any proposal for extensions to the network stack will

succeed without being backwards compatible with legacy applications or the network CHAPTER 2. RELATED WORK 43

stack (using TCP). It is not practical to have a flag day where network communication

worldwide (end user applications, services, core infrastructures etc) would migrate from

TCP to the proposed solutions; if this were possible, migration from IPv4 to IPv6 would

not have been as slow and challenging as it has been.

Transition Cost vs. Value It appears that the foremost reason for limited adoption of • proposed solutions is the resistance to move away from TCP due to its widespread de-

ployment; TCP’s success, it seems, is a huge hurdle in the evolution of network commu-

nication [7, 67]. In other cases, the significant porting effort and limited scope of the solution prove to be detrimental, in that the value offered is far less than the transition

cost [3]. Unless the value offered by the proposed solution is such that it outweighs the transition cost, it is unlikely that the proposal would see wide-spread adoption. On

the other hand, as we’ve seen in case of Multipath TCP’s adoption [91] and with Open-

Flow [15], if the big players (such as Google, Facebook, Apple, Cisco) push for a change,

then perhaps the proposal might see less challenges in wider adoption [66].

In light of the discussion above we recognize the following guiding principles when proposing extensions and/or communication models for modern use cases. For a proposal to be adopted by the wider community, it needs to be backwards compatible with legacy application as well as legacy stacks. The proposed extensions must deliver enough value to encourage the transition from legacy TCP to the proposed extension. Finally, the proposed extension must enable future innovation, so that it does not become the new status quo. Chapter 3

Session-Based Communication Model

Enabling Context-Awareness

To implement modern use cases, developers require support from the underlying stack. If such support is not available, extending existing implementations would suffice. However, legacy stacks pose challenges on both accounts.

While legacy stacks have met the communication needs for the proverbial 80% of the use cases, they do not enable the remaining 20% scenarios that support modern use cases. Therefore, to meet the demands on innovation in modern communication, developers implement network- ing libraries, which work around the limitations of legacy stacks, thereby enabling contempo- rary use cases; Android OS [4] and Apple Continuity [5] are examples of such implementations among many others. This leads to considerable duplication of effort.

On the other hand, innovation itself is challenging due to limiting assumptions of existing ab- stractions and their implementation. For example, the assumption that the network address does not change during communication, or that the duration of a connection is typically shorter

44 CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 45 than the lifetime of a network label, has led to implementations where connection labels are defined in part using network labels — i.e., services using sockets API have the network ad- dress as part of the transport label [92]. This inhibits mobility. Similarly, limited support for extensions in legacy stacks (e.g., limited space for new TCP options, which can’t get through middleboxes [7,67]) has led to radical proposals such as QUIC [77].

In this chapter we present our proposal of an extensible session-layer intermediary (SLIM), which provides session semantics to support innovative communications, enables future ex- tensions to the network stack, and does so in a manner that reduces duplication of effort. This layer:

Uses session, flow, and end-point abstractions to support current and future communica- • tion patterns; and

Enables support for communication between two or more participants, along with fea- • tures supporting mobility and resilient transport;

Provides an out-of-band signaling channel for the exchange of control messages enabling • the dynamic reconfiguration of ongoing communications and rejuvenates extensibility by

creating a new middlebox-compatible option space.

Consider an example of a finance application, with a user interface on a smart phone and the service deployed in the cloud that interacts with stock exchanges to pull data and make trade decisions. Since the software service-level agreement requires minimal latency, it is impera- tive that the service moves to a data center near the stock exchange. For international trade applications, the service would need to move across the globe without disrupting communica- tion. SLIM can not only mitigate the issues that inhibit mobility, but also facilitate the design of the service, where the session, flow and end-point abstractions can describe communication between the building blocks (e.g., data source, trade and decision model, transaction manager CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 46

and user interface). With out-of-band signaling available, extensions such as dynamic recon-

figuration of the stack, service discovery or seamless migration of processes between hosts can

be assisted by the session layer. Similarly, with multi-party semantics, a single session can be

used by multiple participants to manage a conversation. With the ability to group flows in a

session, common configuration parameters can be managed for all flows rather than defining

individual configuration of constituent flows. Therefore, SLIM has the potential to facilitate

modern use cases as well as pave the way for further extensions to the network stack.

With regards to SLIM, our contributions are:

A discussion about the conflation of session management with transport semantics in • the exiting socket abstraction and suggesting the need for separate session, flow, and

endpoint abstractions;

The design of an extensible session-layer intermediary, SLIM, which supports typical and • advanced communication models using the above abstractions; and

A proof-of-concept implementation of SLIM that exposes an API for the aforementioned • abstractions, explaining how backwards-compatible extensions enable incremental adop-

tion.

Later in section 3.7, we discuss a case study, where we establish how context awareness benefits

the setup and reconfiguration of complex communication constructs.

3.1 Conflation of Session and Transport Semantics in TCP

In order to instantiate a connection, TCP executes a 3-way hand shake, which in turn establishes

the communication session. However, this implicitly ties semantics of the session with the CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 47 semantics of the TCP transport connection. Not only does the 3-way hand shake bootstrap the connection but it also bootstraps the session, and the life cycle of the session is coupled to the lifetime of the TCP connection. Given the needs at the time of implementation, this was an effective and efficient solution. It was a significant contribution to the overall success of the

Internet as we enjoy it today. At the time, much network communication fit the model of two nodes exchanging information. Where communication was more involved, it was easy to map to a set of peer-to-peer connections, and the overhead of executing session bootstrap multiple times was not significant in the overall structure.

Today, this model is strained. Without explicit session management, developing modern use cases becomes a challenge as the legacy stack does not support the functionality necessary to re- alize them. For example, support for mobility is necessary to enable process migration. Legacy abstractions assume that conversations begin with the instantiation of transport connections and end with their termination. If processes were to migrate between hosts, the communica- tion session would continue, however the existing transport connections may become invalid and new connections would be required. To not have application developers implement fault tolerance and necessary bookkeeping, it is important to distinguish between the session and transport semantics and efficiently assist in their roles in respective layers.

With explicit session management, we open multiple avenues for innovation and extensibil- ity. In case of applications establishing multiple streams between hosts, an existing transport connection can be leveraged to instantiate new connections — the 3-way handshakes becomes redundant for subsequent connections because the session has already incurred the cost of bootstrapping. Doing so also allows data to be sent along with the first TCP segment. Sim- ilarly, congestion or flow control window sizes can be derived from the existing connection without violating fairness constraints, thus avoiding the slow start phase.

Some communication libraries also conflate session semantics with their functionality. For CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 48

example, TLS also establishes a secure channel per stream. Instead of securing each stream,

it seems prudent to record such details as an attribute of the session. TLS does implement an

optimization where an authenticated session key can be reused. However, this approach is not

used widely as servers need to maintain a speculative session cache per client, which is costly

for high-traffic services.

With SLIM, based on the session, flow and end-point abstractions, we are able to separate the

semantics of session management — setting up, reconfiguring and tearing down communi-

cation sessions — from that of transport — which focuses on efficient data delivery. This is

because one is independent of the other. Doing so allows us to enable mobility, dynamic stack

reconfiguration, and communication between two or more participants.

3.2 Session-Layer Abstractions

Since the conflation of session and transport semantics in TCP leads to challenges in supporting

the desired communication patterns, enabling increased flexibility and extensibility requires

separating and improving the abstractions. In this section, we describe the end-point, flow

and session abstractions that form the session layer and discuss their interactions. (Figure 3.1

illustrates a representative session involving three end points with two data flows and one

control flow.)

Endpoint: An endpoint is a process participating in a conversation and represents a source and

destination of communications. Note that the real endpoint is often the user with some process

serving as proxy; in computer-to-computer communications, both endpoints are processes.

Our definition of an endpoint is in contrast with the definition of an endpoint in the socket

API. The socket API defines the endpoint as being an immutable entity — i.e., an identifier that

is associated with a 5-tuple local IP,local port, remote IP,remote port, protocol at the time of h i CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 49

End$Point$A$ Control' Session' flow' abstraction' Network'

Data' flows'

End$Point$B$ End$Point$C$

Figure 3.1: The session abstraction involving three participants, each with two data flows instantiated by the application. The control flow is an out-of-band channel that allows setup and reconfiguration. instantiation and is presented as a file descriptor [8]. The contrast is illustrated in Figure 3.2.

Network Process

Transport connection Socket Endpoint abstraction abstraction

Figure 3.2: Contrast of endpoint and socket abstractions.

Modern use cases require that the association of the process and the host that houses the end- point not be permanent. Instead, an endpoint should be identified by a label independent of the network address of its host. A process may change hosts (in case of service migration), may change network attachment points (in case of mobility), or may want to use multiple network paths (and therefore potentially use multiple network interfaces). Therefore, to construct an endpoint label independent of the underlying layers, we draw inspiration from the Host Iden- tity Protocol (HIP) [39] and use public keys as the foundation for building identifiers. Since private-public key pairs are uniquely associated with users or services [93], using the public key as the foundation of a unique label is prudent for modern communications. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 50

As shown in Figure 3.3, we construct a unique endpoint label using a cryptographic hash [39] (or a fingerprint) of the user’s, service’s, or client’s public key along with a suitable tag to distin- guish between endpoints. Based on the mathematics of the birthday problem, we can conclude that with the hash size of more than 100 bits, we can safely assume that a collision will not occur until one quadrillion (i.e., 250) hashes are generated. For this reason, we chose 128 bits as the hash size, 24 bits as the distinguishing tag size and 8 bits for the hash-algorithm type, resulting in a label size of 160 bits (20 bytes) and allowing more than 16 million endpoints per key and 256 hash algorithms to choose from. Implementations may choose different label sizes based on the hash algorithm type and the number of endpoints allowed per key. Note that Figure 3.3 is the logical representation of the label and does not cater to implementa- tion concerns. For convenience, we also use human-readable tags mapped to endpoint labels

(e.g., meetup.alice).

i bits

type cryptographic hash (public key) tag

j bits k bits

Figure 3.3: Endpoint label.

The endpoint labels may be translated to obtain information that describes how the endpoint may be contacted. In the case of TCP,this contact information would be an IP address and port number. We can imagine adapting the domain name system (DNS) [94] to be a translation service.

Deriving the endpoint labels from the public key enables intrinsic security in the design of communications. For example, access control services (of identification, authentication, and authorization) may be enabled by verifying digitally-signed endpoint labels, which we will discuss further in § 3.3.2. In the same vein, symmetric ciphers may be derived using associated CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 51 private-and-public keys to enable confidentiality. Note that the use of PKI does not require deployment of additional infrastructure beyond what already exists.

Flow: A flow represents a data exchange between a set of endpoints. It gives a name to the concept of communication but requires mapping onto underlying transport connections before communication actually occurs. Because flows are independent of transport connections, the concept of a flow can precede the creation of a transport connection and can persist after the transport connection has been closed. This separation allows us to distinguish between session and transport semantics. Doing so further enables reconfiguration of flows on the fly (and sub- sequently their association with underlying transports). This is illustrated in Figure 3.4 where a flow may be mapped onto a transport connection p and later mapped onto a transport con- nection r (e.g., when transport p is disrupted due to reconfiguration or migration of the client into a different subnet). Reconfiguration of transport connections is not possible when using the socket abstraction since it essentially means terminating and setting up a new instance with intended parameters.

We identify each flow with a human-readable label that is mapped to an opaque identifier, unique within the scope of the session. Initially, the endpoints have different opaque labels for the same flow. However, endpoints may exchange and agree upon flow labels for the duration of the conversation. We explain this further in § 3.3.2.

SLIM supports the notion of data and control flows, both shown in Figures 3.1 and 3.4. SLIM data flows are visible to the application; SLIM control flows are not. SLIM exposes data flows to the application as a means for exchanging data between participants. On the other hand,

SLIM uses control flows to configure or reconfigure session-layer abstractions. For example, the control flow may be used to seamlessly create and destroy transport connections (as needed) to support mobility, resilience, and dynamic reconfiguration. It also uses control flows as an extension mechanism. We discuss control flows further in § 3.3.2. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 52

Time Session control flow Session data flow 1 data flow 2

transport connection α transport connection p

Transport transport connection q transport connection r

Figure 3.4: The flow abstractions and their mappings onto underlying transports in relation to time.

The manner in which a flow is mapped onto a transport is guided by its structure. There are at least two types of structures for flows: 1) broadcast and 2) one-to-one. With a broadcast structure, the reads and writes to the flow involve all participants. On the other hand, a one- to-one structure suggests a point-to-point link between the participants. This is illustrated in

Figure 3.5. Other forms or structures are worthwhile; however, we focus on these two common structures in this research.

b b

a c a c broadcast flow 1-1 flows

Figure 3.5: The structure of flows in relation to the endpoints.

Session: A session represents the complete conversation between participants in an agreed-upon context. It encapsulates endpoints and flows that constitute the conversation and allows them to be reasoned about together. This is illustrated in Figure 3.1.

Each session is labeled with a session identifier, chosen by the endpoint initiating the session. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 53

As illustrated in Figure 3.6, the label consists of the initiator’s endpoint label and a distinguish- ing tag. While the session label includes the initiator’s endpoint label, the session label is meant to be taken as an opaque identifier. We also map a human-readable label to the session label

— e.g., org.meetup. The session label must be globally unique if it is to serve as a publicly accessible session. It is this globally unique label that is to be used by endpoints interested in joining the conversation. Note that the globally-unique characteristic is achieved since the ses- sion label is based on the initiator’s endpoint label, which itself is based on a globally unique public key. We chose the tag size to be 32 bits, which allows for about four billion simulta- neous sessions per endpoint. As with the endpoint labels, different session tag sizes may be implemented. Our choice of 32 bits is intended towards accommodating a sufficient number of sessions per endpoint and thus, future-proofing labels. Session labels may be published and publicly visible or may remain unpublished. We discuss aspects of session labels registration further in § 3.3.1.

Each endpoint maintains a local view of the session. The identities of the endpoints are main- tained as part of the session state. In addition, identities and descriptions of flows originating from or terminating at the end point are also recorded as part of the session state. A local ses- sion state available at an endpoint would not include information about flows that the endpoint is not involved with (e.g., endpoint a would not be aware of the 1-1 flow between endpoints b and c in Figure 3.5).

endpoint / initiator label session tag

i bits m bits

Figure 3.6: Session label. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 54

3.3 SLIM’s Architecture

As Figure 3.7 illustrates, SLIM exposes an API while providing three sets of services to the

application to assist with communication setup and management. These services fulfill three

roles: 1) session management, 2) negotiation of configuration, and 3) data transfer. SLIM uses

the underlying transport services to realize the session abstractions.

SLIM does not force the application to gain network access through it, as shown in Figure 3.7.

The reasons for this are two-fold. First, it highlights that SLIM is designed for incremental

adoption and therefore does not force applications to use SLIM to access network services;

applications may continue to use the legacy socket API if they choose to do so. Second, it

emphasizes that SLIM is primarily engaged in communication setup (in the beginning) and

management (in case of reconfiguration) and does not interfere with data exchange. As we

explain later, SLIM is a “pass through” for data exchange once communications are setup.

Session Negotiation of Data Management Configuration Transfer

Application

SLIM

Session Layer Transport Intermediary (SLIM) Internet Protocol

Link

Figure 3.7: SLIM in relation to the network stack. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 55

3.3.1 Session Management

SLIM session services allow sessions to be instantiated, configured, reconfigured, and torn down. Figure 3.8 provides a state transition diagram illustrating relationships between session primitives and states of the session abstraction.

SLIM’s state-transition diagram may appear to share similarities with the TCP state-transition diagram [1]. This is not because SLIM’s design is a derivative of TCP’s implementation; rather, as explained in § 4.3, legacy TCP implementations conflate session and transport semantics, hence its state-transition diagram has aspects relevant to session management. It is these shared aspects of session management that are reflected in both the state-transition diagrams.

Session-Related Primitives: Typically, an endpoint expresses its willingness to communicate and is then joined by other endpoints. An endpoint creates a session to encompass the intended communication and awaits contact from other endpoints. Later, other endpoints join the ses- sion and one or more data flows are added to begin communication. Alternatively, endpoints may be invited to participate in a session. Endpoints can leave the session at any time without disrupting the ongoing communications.

Joining a session requires knowledge of the session label. The session can be publicly advertised by registering the label with a session discovery mechanism. An endpoint wishing to join the session looks up the label from the session registry and requests a translation of the label into contact information. This contact information consists of the details necessary to reach the endpoint. An implementation of SLIM over TCP considers the network and port addresses as contact information and allows these to change during the lifetime of the session. Alternatively, the contact information for a session can be conveyed to an endpoint by other means (e.g., if the session is not registered publicly). CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 56

Although highly unlikely, there is a possibility that an endpoint joining a session may face a collision of endpoint labels. In this case the collision is resolved by choosing an alternate dis- tinguishing tag (see Figure 3.3). Further naming issues, e.g., multiple identities per endpoint, are beyond the scope of this research and will be tackled as part of future work.

create()'

INITIALIZED! register()' await_call()' await_call()' REGISTERED! LABEL!

join()'/' !! AWAITING! await_call()' invite()' CONTACT! attend()' end()' register()'

!! !! revoke()' leave()' !!ESTABLISHED! end()' REVOKED! DEPARTED! reconfigure()' LABEL!

end()' TERMINATED! end()' cleanup()'

Figure 3.8: Session state-transition diagram.

The session state recorded at the registry needs to be managed. The record is expunged when an endpoint ends the session, or it is maintained via a heartbeat mechanism that indicates the session’s continued existence. Other primitives that assist with negotiation of configuration are discussed in § 3.3.2.

Flow-Related Primitives: The add_flow primitive creates a data flow within the session. It takes a named parameter session and optional inputs of [structure] and [type]. The structure defines the communication model for the conversation (e.g., broadcast or one-to-one). The type defines the mapping onto the underlying transport (e.g., stream- or message-oriented transport). There may be more than one flow between processes. Although flows originating from the same endpoint are independent of each other, they are associated with each other CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 57 through the endpoint. This is relevant because flow reconfiguration primitives may be applied to the session as a whole instead of individual flows — e.g., to change underlying transport protocol from TCP to Multipath TCP.The terminate_flow primitive tears down the flow.

Session Registration and Discovery: The session registration and discovery mechanisms as- sist with recording and translating session labels to the information needed to participate in the session. Upon registration, a mapping is created between the session label and at least one endpoint participating in the session. The mapping is removed when an authorized participant revokes the label. In addition, for sessions listed with the registry, an update to the list of par- ticipants is made when an endpoint joins of leaves the session. This association of the joining or departing endpoint is initiated by the relevant endpoint itself.

We refer to the process that registers the label as the initiator. Any process that wishes to par- ticipate in a session can query the registry to acquire contact information about the participants to subsequently join the session. The only requirement is that the intended participant should know the label that it wants translated. For this purpose, the human-readable labels associated with session labels make this requirement relatively convenient to manage.

Mechanisms exist that address similar challenges of label registration and translation, which can be adapted as a session registry. For example, we envision a DNS [95]-like hierarchical

LDAP [96] service that would not only meet timeliness requirements, but would also scale well. The use of hierarchical human-readable labels is expected to enable scalable solutions. Note that participants that do not wish to register their session can still communicate. However this assumes that contact details of at least one participant are available. An in-depth investigation of session registries are beyond the scope of this research. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 58

3.3.2 Negotiation of Configuration

In support of more sophisticated communications, SLIM uses the control flow to exchange commands (or control signals) between participating network stacks and peers. These com- mands, called verbs, include for example, requests to suspend or resume a flow or change the underlying transport protocol.

Verbs: Control flows exchange verbs to configure communications. The facility to exchange control information or alert peers of a change in the communication context, along with the ability to reconfigure abstractions after communication has been setup, enables dynamic con-

figuration.

For example, Listing 3.1 shows an example JSON representation of the sync verb requesting the peer stacks to update their endpoint to network address mappings (e.g., when the source moves between subnets). This may result in a new transport connection if the existing transport is not valid anymore. Applications will not be aware of this adaptive behavior, since SLIM insulates them from the underlying transport via the flow abstraction. The transition, labeled as reconfigure, in Figure 3.8, reflects such reconfiguration.

Required fields of the verb indicate the source of the request, the session label, a transaction

ID, and an authentication token. These enable stacks to distinguish between requests and to ensure authorization. The remaining fields are verb-specific.

{ "VERB" : "sync", "SOURCE": "meetup.alice", "TRANSACTION_ID": "1872", "TIMEOUT" : 20, "SESSION_LABEL" : "org.meetup", "AUTH_TOKEN": "2N8ISiGELBzNw1sOunAxgOF3MrQF4ugf", CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 59

"PAYLOAD" : { "list" : [{ "END_POINT_LABEL": "meetup.alice", "IP" : "192.0.1.222", "PORT": 5432 }] } }

Listing 3.1: An example of a verb and its payload, represented in JSON, requesting an update of endpoint label mappings.

Activities (e.g., endpoints joining a session, reconfiguration requests through verbs) may be au- thenticated using the authentication tokens. These tokens represent digitally-signed endpoint labels, which can be verified as the endpoint label is based on the public key of the user, client, or service that the endpoint is representing [93]. Also, the authentication process creates an opportunity for the recipient of verbs to pose compute-intensive puzzles before taking action and thus preempt denial-of-service attacks.

Much as network stacks disregard unknown TCP options and thereby facilitate incremental adoption of new extensions [1], the SLIM control protocol requires stacks to disregard un- known verbs. Currently, the set of verbs are limited to those needed to support migration and resilience (e.g., sync, suspend_flow, and resume_flow are useful for migration and recovery from disruption). Other verbs, such as change_protocol, allow on-the-fly recon- figuration of the underlying transport protocol.

Enabling Future Extensions: It is well established that sufficient TCP option space, particu- larly as part of the TCP SYN message, is no longer available for extensions that require space for control signaling [7]. To further add to the complexity, we know that middleboxes either do not allow TCP packets with custom options to traverse through them or strip custom op- tions [29,67,71,73]. The control flow provided by SLIM serves as a signaling channel for future CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 60 extensions. For example, to implement support for process migration or state synchronization between peers, verbs may be defined and implemented over the control channel. New primi- tives may then be implemented and exposed to support future network services. Examples of such extensions include change_protocol and sync. In § 4.5.3 we explain how verbs are implemented with corresponding handlers. To add extensions, developers implement handlers that receive, consume, and act upon the verb and its parameters.

Context Management: In addition to allowing the application to indirectly use the control

flow as a means for exchanging control signals, a context manager, also makes use of the con- trol flow to exchange control signals between the network stacks. For example, when a network interface is disconnected, the transport connection using the interface would timeout. To avoid having communications fail, the context manager recognizes the change and triggers an instan- tiation of a new transport connection that uses an alternate interface, if one is available. This would also trigger an exchange of a sync verb to synchronize state.

Here we have considered one aspect of context management (i.e., resilient communications).

However, there are other possibilities that can be explored to enable the network stack to be cognizant of its operating environment. An example of such context awareness may be to recognize that multiple network paths to peers exist through different network interfaces on the host and thus enabling multi-homing, fail-over, or redundant communications. Similarly the network stack may recognize the existence of an accelerator or a gateway in the network and enable configuration of communications — e.g., engaging SSL accelerators or interacting with captive portals. With session-based abstractions, SLIM serves at a suitable vantage point to realize such a variety of goals that include policy enforcement, and dynamic reconfiguration. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 61

3.3.3 Data Services

Once communications are setup, the flow abstraction merely serves as an indirection to the transport connection. Thus, SLIM essentially acts as a pass-through for an application writing to or reading data from a flow, until there is a disruption (e.g., connection loss due to migra- tion), or a verb is triggered to reconfigure communications — e.g., SLIM participates to restore communications. It is significant to note that this pass through nature of data services do not pose a measureable impact on performance of communications.

3.3.4 Session vs. Transport Semantics

From the discussion above, we see how the semantics of managing the session can be envi- sioned without conflating them with the transport semantics. This is in contrast to the socket abstraction, discussed in Section 3.1, which conflates the roles of both into one. We see that all the session-related concerns can be dealt with independent of the underlying transport.

Whether the concerns are about creating a session, configuring it to include participants and communication flows, reconfiguring the session and or flows to accommodate change of oper- ating environment or preferences, or gracefully tearing the session down, all are independent of the underlying transport semantics. In case of TCP,it may continue to fulfill its responsibil- ity of transport semantics, that is in-order and reliable delivery as well as congestion and flow control to enable network adaptation.

Separating these responsibilities allows us to realize optimizations that we discussed earlier.

For example, once a connection has been established to the host, we could use parameters associated with it to bootstrap any new connection to the same host, without going through the three-way handshake. This may allow us to send data along with the first TCP segment, CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 62

avoid the slow start phase, and benefit from knowledge of network performance parameters

associated with the existing connection.

With multi-party communication a session may include more than one flow between partici-

pants, therefore the question of congestion control per session is not as simple as in the case

of a single, independent transport connection. Instead, here we may want to consider all the

flows forming the session as a whole. Although, for our current prototype we do not implement

such a mechanism — and we depend on congestion control independently maintained by each

flow — it is possible to implement a holistic mechanism such as Congestion Manager [33] or

that used by Multipath TCP [29].

3.4 Communication Patterns

Below we discuss how several well-known communication patterns [97] benefit from a session abstraction, in general, and more specifically from SLIM. An illustration of these patterns is

presented in Figure 3.9.

Client$Server$ Peer$to$Peer$ Pipeline$

Publish$and$$ Survey$ Broadcast$ Subscribe$

Figure 3.9: An illustration of communication patterns. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 63

3.4.1 Client Server

All communication patterns benefit from the separation of session and transport semantics; once a session has been set up, flows of all varieties can be repeatedly created and terminated, without incurring additional session-setup costs. As an example, a browser accessing a web page typically creates multiple connections to a server to obtain content for constructing the page. In this client-server pattern, once a session has been established, the transport connection- setup cost for subsequent connections is avoided because transport parameters between the same two hosts can be shared rather than recreated.

SLIM also assists with the design and understanding of a client-server system in which a process acts as a proxy for another group of processes. Consider a user accessing a social-media service; the client process acts as a proxy for the human with a single data flow to the server process.

On the other hand, the server process acts as a proxy for a multi-tiered social-media service implemented by multiple services/processes. Here the client’s view of the session represents the conversation between the client and the server where as the server’s view of the session also includes its interaction with the multi-tiered application processes.

3.4.2 Peer to Peer

Peer-to-peer communication is implied when processes are not confined to specific roles. The design of bittorent applications, for example, creates connections to hundreds of peers at the same time. The exchange of distributed-hash-table records, lists of available files, and content segments can be shared over the independent flows in the same session greatly simplify the design of protocols between peers. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 64

3.4.3 Publish Subscribe

SLIM benefits applications following the publish-subscribe pattern by providing direct support for multi-party sessions with dynamic reconfiguration. For example, clients can subscribe and unsubscribe to a video streaming service yet be treated as a whole by the server through the session abstraction. While responsibility for changes in video quality in response to varying network conditions belongs higher in the stack (in the “presentation layer”), feedback from the flows to the application is required for the quality to be properly adapted. Such feedback, while not yet implemented, is within the purvey of SLIM.

(Note: similar to the client-server pattern, the publish-subscribe pattern naturally gives rise to a hierarchical communication model where end points appear to be singletons from one perspective but are in fact complex multi-process service implementations from another per- spective.)

3.4.4 Broadcast

In the broadcast model, multiple participants engaged in the same session all data from each other. SLIM provides the capability to apply primitives to individual or groups of flows giving application developers greater flexibility while at the same time eliminating the tedium of bookkeeping for each flow individually.

3.4.5 Survey

The survey model has similarities with the publish-subscribe model. Here one process makes a request, which is answered by multiple participants. The surveyor process may register a CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 65 session which participants may join to respond to the requests. The session abstraction would describe views of all the participants. Consider for example a collector in the finance applica- tion mentioned earlier, polling data from the stock exchange. The session abstractions would represent the view of communication for the collector engaging the data sources over multiple, independent flows, as well as the view of the individual sources, each with a single flow to the collector.

3.4.6 Pipeline

The pipeline model of communication is a limited version of the peer-to-peer model, where a process can only assume one role — either be a producer or a consumer of information. Since the session abstraction successfully describes the peer-to-peer model, describing the pipeline model is the same, except for the limitation of unidirectional communication.

An Example with Multiple Participants Imagine a session where a video streaming service broadcasts a soccer match with video streams to multiple devices and match statistics to se- lected devices, while serving the same user; this is a publish-subscribe communication pattern.

A chat application is also available with a peer-to-peer communication pattern. Note that the session abstraction does not confine its use to a particular model; the session abstraction simul- taneously exhibits behavior of several different models — i.e., peer to peer as well as publish and subscribe models. Here we have multiple participants interacting as part of the same session. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 66

3.5 Prototype Implementation

Here we discuss a prototype of the session-layer intermediary, SLIM, implemented as a user-

space library in C. The application interface, session-layer primitives, control signaling, session

registry and a shim layer for backwards compatibility with legacy applications are implemented

in 3189 lines of source code (without comments).

3.5.1 Session State

In the prototype, a view of the session is maintained for each participating process. This ses-

sion state includes details of participating endpoints, the flows that exist between them, the

configuration and structure of the flows, the flow-to-transport mappings, the available network

interfaces on the host where the endpoint resides, and the session type. In addition, the iden-

tities of the session, flows and endpoints and the mappings of those identities — e.g., endpoint

label to location mapping (IP and port addresses) — are maintained as part of the session state.

During the lifetime of communication, the session state changes. The changes that are local to

the endpoint need not be shared with the other participants (e.g., the sequence space mappings

of the flows to underlying transport). However, the changes in state that are relevant to the

entire session may need to be shared with participants to maintain consistency (e.g., endpoint

label to location mappings). These updates are shared through the control flow (using the

sync verb) and are initiated by the source of the change. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 67

3.5.2 Data Flows

Data flows are added with the add_flow primitive. Each data flow is identified internally by a unique flow label and mapped to an appropriate transport connection (stream or message) depending on the flow type. The flow label exposed to the application is mapped to the file descriptor returned by the standard libc socket API.

3.5.3 Flow Labels and Greater Functionality

For legacy application behavior of the flow labels merely serves as a reference to the session- level concept of a flow. However, they play a significant role in supporting extensions to the stack, for example, in case of mobility. After the host servicing an endpoint moves to a different subnet, the network address assigned to the interface may change. This invalidates the trans- port connection that the flow was mapped onto (for simplicity, consider the example of two participants involved in the session). However, SLIM recognizes the change and instantiates a new transport connection, onto which the flow is mapped. The participants are able to recog- nize that this is the same flow since the sync verb initiated by the source triggers an update of mappings between the flow and new transport connection at the recipient. Note that this allows the communications to work in spite of the presence of middleboxes (e.g., NATs that may change the IP addresses associated of traffic flowing through them).

3.5.4 Structure of Flows

To realize a one-to-one structure for flows between endpoints, we have a correspondence be- tween a flow and a transport connection. On the other hand, for a broadcast structure, we implement an all-to-all connectivity of transport connections between the endpoints, to which CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 68 the flows are mapped. Note that the focus on broadcast and one-to-one flows is not funda- mental and that the entire spectrum of configurations can be supported. For example, for peer-to-peer communications, a broadcast structure might not be feasible and a Chord-like configuration [98] may be appropriate. However, we leave that exploration for future work.

3.5.5 Flow-to-Transport Mappings

The flow-to-transport mapping requires a mapping of sequence spaces between the two. The session-layer implementation delegates the responsibility of managing transport semantics

(e.g., retransmission of lost bytes) to the underlying transport implementation. Doing so al- lows the book-keeping of delivered and undelivered bytes in the context of flows to be relatively straight-forward. We leverage our experience from our previous work [72,74] in creating this mapping of sequence spaces. If a transport connection ends prematurely, a new connection is created and the data flow is mapped onto it. Without duplication of effort, we are able to use the mapping of sequence spaces to ensure that no data is lost during reliable transfers. This in- direction allows the session layer to provide resilience and hide the messy details of adaptation from the application. Comprehensive details of the sequence space mapping are documented in our prior work [72,74].

3.5.6 Control Flows

A control flow is created when an endpoint joins the session; the intention here is to have a control flow between endpoints participating in the session to allow exchange of control signaling. Note that in our implementation, a control flow1 is the same as a data flow with the

1The management of control flows and execution of verbs is implemented in independent threads. This avoids interference with the application’s execution thread. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 69 exception that the abstraction is not directly exposed to the application.

Control flows are used to exchange verbs, which are commands that trigger corresponding handlers. For example, the receipt of the verb suspend_flow triggers the launch of the cor- responding handler v_suspend_flow(). After ensuring that the request is valid and autho- rized, the parameters are processed to effectively pause further writes and reads to and from the flow — e.g., this may be required when migrating between subnets for which new trans- port connections need to be established to resume connectivity. Note that extending SLIM to include additional verbs translates into registering the verb and the corresponding verb issuer and handler with the library. For example, the verb sync has its corresponding issuer i_sync and handler v_sync registered with the library.

Typical verbs are implemented as non-blocking requests. The recipient acknowledges the re- ceipt with (success or failure) codes. However, this does not preclude implementation of verbs as blocking requests. Blocking requests are prudent where transactions may need to be rolled back if unsuccessful. This behavior is implemented as part of the verb issuer.

Context Manager: A context management thread runs in the background with a low profile.

The intent here is to allow the session layer to be cognizant of the circumstances in which communication is taking place.

In the prototype implementation, the context manager creates a netlink socket and listens for network interface events through RTMGRP_LINK. If the interface is disconnected (i.e., the link goes down) and an alternate interface is available, the context manager triggers the setup of a new transport connection over the alternate interface and issues a sync between the participants to resume connectivity. Alternate solutions that listen for OS events may be used instead, to solve this problem. We plan to evaluate such methods with SLIM’s implementation as a kernel module. We implemented this feature as an example of other possibilities that range CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 70 from variety of dynamic configurations to applications of policy enforcement (e.g., assisting captive portals).

3.5.7 Session Labels and Registry

The session maintains a session identifier and a human-readable label, which may be published in a session registry. We implement a rudimentary LDAP service to act as a session registry and anticipate that a production deployment would involve a DNS [95] like hierarchical LDAP [96] service. Upon registration, the members’ endpoint details are listed along with the label. The register, push_update, revoke, and translate primitives enable interaction of the ses- sion layer with the registry and ensure that it is kept up to date. Further details are available in the technical report [99].

3.5.8 Support for Legacy Applications

Since the prototype implementation of SLIM is available as a user-space library, applications using the API simply link to the library for network access. To support legacy applications, which use the Socket API, we have implemented a shim layer, which intercepts Socket API calls using LD_PRELOAD and then maps these to the SLIM API. This is illustrated in Figure 3.10. Doing so allows SLIM to be backwards compatible with the legacy applications.

Legacy App App LD_PRELOAD wrapper SLIM SLIM Socket API Socket API SLIM supporting SLIM enabling greater legacy applications functionality

Figure 3.10: SLIM in relation to legacy applications and those using the library. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 71

Understandably, with the use of the shim layer, legacy applications will not be able to make use of all the features supported by SLIM, other than those that we implement as part of the shim layer. On the other hand, applications that are programmed using the SLIM API are able to benefit from the greater functionality that SLIM enables. Note that the illustrations in

Figures 3.7 and 3.10 highlight that SLIM serves as a wrapper around the Socket API.

3.5.9 Support for Mobility, Migration, and Resilient Communications

Leveraging our experience from our prior work [74] we highlight how separation of session and transport semantics can enable migration of services between networks without disrupting communications. We demonstrated a live migration of a virtual machine (hosting an SSH server) from Blacksburg, Virginia on the East Coast to Sunnyvale, California on the West Coast.

In spite of the migration to a different network, the client application remained connected to the service and continued to operate successfully [100]. This was only possible because the session-layer flow was able to update its mapping from one transport connection (to the service hosted in the Blacksburg network) to another after the migration (in the Sunnyvale network) as illustrated in Figure 3.4. The indirection introduced due to separation of session and transport semantics allowed SLIM to enable resilient communications.

We also highlight the ability of SLIM to demonstrate resilience, by leveraging our experience from prior work [72,73], where we interrupt communications by physically disconnecting the network between a legacy client application (Video LAN) and a media streaming server. The physical disconnection resulted in the network interface going down. Upon reconnection, the context manager recognizes the change in interface status and triggers a synchronize event.

Following which the media continues to stream to the client application without losing the session. This would not have been possible with legacy TCP,for which the socket would become CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 72

invalid after the disconnection.

We continue this discussion further in Chapter 4.

3.6 Discussion

With explicit session management along with the built-in control flow, SLIM provides support

for several innovative uses of the network not well served by legacy TCP,including resilience,

migration, and stack extension. It also has the potential for simplifying and refining existing

communications. In this context, we discuss and evaluate SLIM’s contributions.

3.6.1 Separation of Session and Transport Semantics via Session-Based

Abstractions

In the previous sections, particularly in § 4.3 and 3.2 we made a case in favor of separating

session and transport semantics as the means to rejuvenate innovation in network stacks. We

explained that legacy stack implementations have conflated session and transport semantics

within the Socket API and why this makes future innovation a challenge. For example, from

the application developer’s perspective, when using legacy implementations the conflation of

semantics results in the transport connection appearing as if it were the entire session; the

beginning and end of the connection is the beginning and end of the session, when in fact the

session and flow semantics are fundamentally independent of how the transport transfers data

across the network. Thus, a feasible way forward is to decouple these semantics and subse-

quently avoid having the limitations of underlying layers permeate through to the applications;

unless the session and transport semantics are decoupled, developers will be forced to imple-

ment modern applications with limited abstractions, they will continue to be constrained by CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 73 the limitations of the underlying implementations, and they will continue to duplicate efforts in implementing session-management to support their use cases [4,5].

With SLIM we are able to decouple session management from transport semantics and sub- sequently the limitations of the underlying implementations. As a result, the SLIM API only exposes session semantics to the developer and thereby simplifies the design of network appli- cations and services. For example, unlike the transport implementations that couple transport labels to network labels, the session abstractions insulate themselves from the limitations of such cross-layer coupling and are therefore not limited by naming constraints. As a result, de- velopers are relieved from the concerns of maintaining resilient communications, and instead focus on utilizing the provided session semantics. We demonstrate the efficacy of this decou- pling with our prototype implementation (§ 4.5.3) and through the virtual machine migration demonstration [74, 100], where we highlight support for mobility and resilient communica- tions. This separation of session and transport semantics also helps in minimizing duplication of effort as we see developers tackling the same challenges over and over again [4, 5], which are in part addressed through SLIM.

3.6.2 Enabling Greater Functionality

All communication models [97] benefit from the separation of session and transport seman- tics. Explicit session management opens avenues for innovation and extensibility. For example, explicit session management enables: sessions with more than two participants, session migra- tion between hosts (in contrast to host mobility), adaptive configuration of communications, multihoming — explicitly binding flow-to-transport mappings to use different network inter- faces (when available), transformation of flows, and variety of flow to transport mappings, etc. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 74

Adding Value to Communication: Consider the example of a browser accessing a web page.

The browser typically creates multiple connections to a server to obtain content for construct- ing the page. In this client-server model, once a session has been established, the transport connection-setup cost for subsequent connections can be avoided because transport parame- ters between the same two hosts can be used as such or derived rather than recreated — e.g., the three-way handshakes become redundant for subsequent connections because the session has already incurred the cost of bootstrapping and thus data can be sent along with the first TCP segment of subsequent connections. This is especially beneficial when flows within a session share authentication and encryption parameters.

Enabling Richer Functionality: SLIM benefits applications using the publish-subscribe model by supporting sessions involving two or more participants with dynamic reconfiguration. For example, clients can subscribe and unsubscribe to a video streaming service yet be treated as a whole by the server through the session abstraction. While responsibility for changes in video quality in response to varying network conditions belongs higher in the stack (in the

“presentation layer”), feedback from the flows to the application is required for the quality to be properly adapted.

Another potential benefit, which we do not support yet, is for peer-to-peer communication where processes are not confined to specific roles — e.g., bittorent protocol simultaneously creates connections to multiple peers, where distributed hash table records, available files, and content segments can be shared over the independent flows in the same session. In such a case, flows may be structured between endpoints in a manner suitable to the algorithms (e.g., in a Chord-like fashion [98]). This simplifies the design of protocols between peers. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 75

3.6.3 Enabling Innovation and Extensibility

Through the control flow, SLIM presents a large control-signaling space for future extensions.

This is in sharp contrast to the shrinking option space available through TCP options — i.e., the space in the TCP SYN message [67, 72]. Having a signaling space to exchange control information between stacks is necessary to enable future extensions. Developers can use the control space to integrate extensions that operate within session semantics by defining verbs and registering corresponding handlers with the SLIM framework, for applications and services to use.

For example, clients connecting to a secure web server may connect through a SSL accelerator which terminates the secure connection and proxies the now unencrypted traffic to the servers that make up the service. Because of the need for all traffic to pass through the accelerator, it has the potential to become the bottleneck. An alternative approach enabled by SLIM is to set up a session between the client and all the servers implementing the service, including the

SSL accelerator. Once authenticated, the session is secure and the participants can switch to using less computationally expensive symmetric ciphers allowing flows to go directly between endpoints without going through the accelerator. Alternatively, the accelerator can migrate the

flow to the servers. Here all the control signaling necessary to manage the session activities can be implemented through SLIM’s control flow.

3.6.4 Backward Compatibility and Adoption

As we discuss in § 4.5.3, backward compatibility is critical for adoption of the proposals by the wider community. With SLIM, we successfully achieve backward compatibility with both network stacks and legacy applications. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 76

SLIM uses the TCP custom options as part of the TCP SYN message to determine if the peer stacks support SLIM. If they do then enriched communications are setup as planned. However, if peer stacks do not support SLIM, the implementation gracefully falls back to legacy TCP.

Further details of the use of custom options is discussed in Chapter 4, the technical report [99] and our prior work [72].

We provide a shim library along with SLIM which intercepts legacy socket API calls from the applications using LD_PRELOAD and then pass it on to the SLIM API. This is when the intent is to indirectly use the SLIM API. If the intent of the application developer is not to use the

SLIM API, the application may be programmed traditionally to use the Socket API and without

LD_PRELOAD SLIM will not interfere with the communications.

By enabling backward compatibility and allowing applications as well as network stacks to gracefully fall back to legacy TCP,SLIM supports incremental adoption by not forcing everyone to opt into using SLIM to setup communications.

3.6.5 In the Presence of Middleboxes

Research indicates that traffic with custom transport headers is either dropped altogether or that the options are stripped off when packets pass through middleboxes [7, 67, 72, 74, 98]. This poses a significant challenge towards adoption when our network infrastructures today host many middleboxes serving useful purposes. Thus the proposal for evolution of communi- cations not only has to be backward compatible with legacy applications and network stacks, but also with network elements that form part of the infrastructure. Also, in some cases, the middleboxes modify the traffic flowing through them to provide their services. For example,

NATs change the IP addresses of outgoing traffic and therefore the recipients cannot assume that the source IP address is that of the original client (or that of a NAT box). Therefore, careful CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 77 considerations need to be made about assumptions.

With SLIM, because we decouple session semantics from that of transport, communications are independent of changes in underlying configuration or modifications to in-flight traffic. The only scenarios where existence of a middlebox could possibly interfere with the SLIM session layer are during: 1) creation of a data flow or 2) creation of the control flow. Setting up a data flow maps to a TCP connection via the add_flow primitive so middleboxes will react the same whether or not the TCP connection is initiated directly or via SLIM. Since control flows are also mapped to TCP connections, middlebox are not be able to tell the difference either.

While NATs, accelerators, and load balancers do not interfere with SLIM, firewalls may cause disruptions since SLIM uses custom TCP options along with the TCP SYN message to setup communications. Surveys [67] show that firewalls typically let custom options through if they are associated with the TCP SYN message (and not otherwise), which we have also confirmed with testing [74]. Nevertheless, if the control flow setup fails, SLIM gracefully falls back to legacy behavior. The key enabler is that SLIM and legacy traffic behavior are identical on the wire.

3.6.6 Applying Lessons Learned to Non-TCP Transport

Though SLIM focuses on improvements to the TCP/IP sockets protocol for resilience and con- nection management, the ideas presented in this thesis are applicable beyond that too. Specif- ically, most network protocol stacks only provide options for connecting specific network and transport addresses. For instance, with InfiniBand (IB) verbs the equivalent to the IP address would be the IB local identifier (LID), and the equivalent to the TCP port would be the IB

Queue Pair identifier () [101]. Despite the different naming convention, the concepts are still the same: these identifiers bind the communication to a specific communication path CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 78 and endpoint. SLIM provides a higher-level abstraction and a runtime system that can manage specific communication channels internally, which is a model that is just as applicable for other network protocols as well.

3.6.7 Development Effort — Cost vs. Value

With SLIM managing the conversation (session management necessary to support dynamic reconfiguration, communication involving two or more participants, mobility, session abstrac- tions to transport mappings etc.) developers are free to focus on their application rather than implementing the underlying plumbing to enable application features.

Using the additional functionality that SLIM enables does require some modifications to ap- plications. However, a subset of the SLIM session API is designed to mimic traditional socket semantics, allowing existing code to run on top of a SLIM stack without change by way of shared library interposition. We demonstrate this work by developing a shim library that uses

LD_PRELOAD to intercept Socket API calls and implement a wrapper for SLIM. While the appli- cation will not gain all the benefits, e.g., communications involving more than two participants or mobility, it will automatically gain the increased resilience that comes from being able to restart flows in a session.

Automated Code Refactoring We have implemented a prototype source-to-source translator that takes code using the Socket API as input and translates it into code that uses SLIM API. We have successfully tested the prototype for correctness on simple client and server applications.

While this is work-in-progress, we’ve concluded that such transformations are possible and would facilitate evolution towards SLIM. Note that a simple refactoring enables support for fault-tolerant communication [72,74]. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 79

3.6.8 User Space vs. Kernel

The transport layer is implemented within the kernel for various reasons including cross-layer communication and performance. Our prototype is primarily in the user space, with the mod- ifications for custom TCP options as the only portion implemented as part of the kernel. We envision a production version to be implemented in the kernel and its functionality exposed with an interface to applications. Functionality which does not lie on the critical path may continue to be implemented in the user space. However, we have not explored other kernel- implementation concerns yet.

3.6.9 Security Considerations

As explained in § 3.2, we intend to lay a foundation that enables well-established information- security methods in assisting session management. We argue that if efficient methods of access control (i.e., identification, authentication, authorization), confidentiality, integrity, or other information security concerns are to be realized, then incorporating such concerns in the design is important (§ 3.2). On the other hand, opaque identifiers may be used for endpoints, flows, and sessions to develop simpler session management solutions. While doing so would not preclude higher-layer solutions of information security, they would not benefit from the gains of an integrated session-layer. Even when simpler/opaque identifiers are used, SLIM extensions do not expose network communications to additional security threats when compared to legacy

TCP.

While the use of public keys as basis for endpoint labels might suggest that anonymity may become a challenge, this is not the case. In fact significant efforts are already underway to address such aspects [102]. Also, since the public-key infrastructure is already well established CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 80 and deployed worldwide, its use doesn’t require deployment of additional resources in the network.

3.6.10 Performance Evaluation

The overarching question that we try to answer here is: does the use of session abstractions incur a negative impact when compared to the use of the Socket API? We find that there is no statistically significant difference between the use of SLIM and the Socket API. The reasons behind this are two-fold: 1) SLIM is engaged in configuration of communications during setup, reconfiguration and tear down phases and is not actively involved during data transfer; 2) The

flow abstractions, which are primarily used during data transfer, are implemented using the

Socket API to manage transport semantics in the prototype.

We set up the environment using dummynet [103]. Dummynet is configured to obtain precise measurements (e.g., set the kernel frequency timer to 4000 Hz since Dummynet’s emulation is coarse grained and bursty in nature for microsecond-level precision, which becomes apparent at low latencies). We define maximum window sizes and buffer sizes to ensure that they do not gate throughput tests. We also enable window scaling and selective acknowledgments.

We define link capacities of 1 Gbps between nodes using dummynet pipes. We vary round trip times (RTTs) and packet loss rates based on typical values [104] — < 1 ms,5ms, 10 ms, 25 ms, 50 ms and 100 ms for RTTs and 0%, 1%, and 10% for packet loss rates. Here we show select results due to limited space.

Throughput: After a session is setup, the flow abstractions are mapped on to the underlying transport. Therefore, we would expect both Socket API and SLIM to achieve similar through- puts. Our measurements confirm that results for different RTTs are statistically the same for all variations of configuration that we tested. Figure 3.11 shows the results for one configuration CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 81 where the endpoints are able to saturate the link to the achievable throughputs of about 94% of link capacity. With increasing RTTs, achievable throughputs are met after relatively longer running tests. Since both SLIM and the Socket API use TCP as the transport protocol we verified achieved throughputs with Mathis’ estimates [105]. With packet loss, we see that throughputs for both the Socket implementation and SLIM are adversely effected with increasing loss rate

(see Figure 3.12).

SLIM Socket

1000 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 750 ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ● 500 ● ● ● ● ● ● ● ●●●● ● (Mbps) 250 RTT (ms) < 1 0

Average Throughput Average 0 10 20 30 40 50 0 10 20 30 40 50 Measurements

Figure 3.11: Average throughput, with 90% confidence interval, for Socket and SLIM (1 Gbps link capacity, 0% loss).

SLIM Socket 1000 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 750 500 % Loss ● 0 1

(Mbps) 250 0

Average Throughput Average 0 10 20 30 40 50 0 10 20 30 40 50 Measurements

Figure 3.12: Average throughput for Sockets and SLIM (1 Gbps link, < 1 ms RTT 0% and 1% loss.

Figure 3.13 summarizes the throughputs from the flow’s perspective with a broadcast struc- ture. Figure 3.14 on the other hand summarizes throughputs of underlying transport con- nections for the same flow. While it may seem that increasing number of participants reduces throughput, this is not the case. When there are three participants in a session, from the sender’s perspective there are two underlying transport connections that implement the flow CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 82

Participants 2 3 4 5 1000

750

500

250 ● ● 0

< 1 5 10 25 50 < 1 5 10 25 50 < 1 5 10 25 50 < 1 5 10 25 50 Average Throughput (Mbps) Average RTT (ms)

Figure 3.13: Flow’s perspective with SLIM and increasing number of participants (1 Gbps link, varying RTTs, 0.01% loss)

< 1 ms 10 ms 50 ms 1000 Connections T1 750 T2 500 T3 (Mbps) 250 T4 0 T5 Average Throughput Average 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 # of Participants

Figure 3.14: TCP’s perspective with SLIM and increasing number of participants (1 Gbps link, varying RTTs, 0.01% loss) with a broadcast structure. Therefore, in this case when the flow observes a throughput of

468.5 Mbps to the endpoints, it is because the underlying transports observe a throughput of

449 and 488 Mbps to each process. These adds up to about 937 Mbps, which is close to the achievable peak throughput. Note that throughputs for underlying transports increase until link capacity is reached.

Setup Time: Time to initialize a session is expected to be nearly equal to the time to allocate memory for the instance. We measured this to be 5 µs, which is the same as initializing a CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 83 socket. There is no statistically-significant difference between the time to create a session with a single data flow and a TCP connection, which is dictated by the RTT between the endpoints

(shown in Table 3.1 ). Usually, sessions will include a control flow and at least one data flow.

The setup time is not impacted by the creation of the control flow as its setup is managed independently in parallel. Once a session context has been established, subsequent data flows do not require three-way handshakes before sending data.

Table 3.1: Setup time between peers

RT T (ms) x¯ (ms) s (ms) < 1 0.16 0.06 TCP connection 50 50.35 0.56 Session with no < 1, 50 0.005 0.001 flows Session with a < 1 0.16 0.07 data and control 50 50.52 0.70 flow

Exchange of Verbs and Reconfiguration: When an asynchronous request is issued, the re- ceiving endpoint returns an acknowledgment with a code. We measure the time it takes for the request to be issued and the return code to be received. A subset of results are presented in

Table 3.2 , where we issue consecutive requests to obtain a trace and also capture the variance.

We see that the latency is dictated by the RTT between the endpoints. The variation measured in our traces is a characteristic of dummynet due to the coarseness of its implementation when inducing delays through pipes. For blocking reconfigurations, the the response time would depend on the type and context of the verb.

Table 3.2: Response time of non-blocking reconfigurations

RTT (ms) x¯ (ms) s (ms) 10 10.36 0.61 50 50.77 0.65

Memory Footprint: With support for higher layer abstractions, a session instance’s memory footprint is larger than that of a socket abstraction, which essentially is a file descriptor. When CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 84 two endpoints are involved in a session, the instance has a memory footprint of 132 bytes. The breakdown is shown in Table 3.3 . The addition of every subsequent endpoint increases the footprint by 42 bytes, while the addition of a data flow increases the footprint by 18 bytes. The above profile is calculated with the assumption that endpoint hosting the session has a single interface. Every additional interface on the endpoint increases the footprint by 22 bytes. While the footprint of 132 bytes seems large in contrast to a 4 byte socket, note that the developer would have to implement similar bookkeeping as part of the application for similar use cases.

Table 3.3: Session’s memory footprint

Bytes Human-readable label 20 Opaque ID 24 One endpoint 20 (label) + 22 (sockaddr_in) Control flow 10 (label) + 4(ID) + 4(handler) One Data flow 10 (label) + 4(ID) + 4(handler) Misc. bookkeeping 10 Total 132

CPU Overhead: We studied the CPU usage of applications conducting small (e.g., 10 KB), medium (e.g., 1 MB) and large volume (e.g., 10 GB) transfers. We did not observe any statis- tically significant difference between applications using the SLIM and the Socket API. This is in spite that the fact that SLIM implements the control flow and a context manager; the control

flow is not over bearing since it only plays a role during communications management and is not involved in the data transfer phase. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 85

3.7 Applying Pipelining to TCP for Efficient Communica-

tion over Wide-Area Networks: A Case Study Exemplifying

Benefits of Context Awareness

Here we present a case study that exemplifies how the ability to configure sophisticated con-

structs using SLIM, on top of the legacy network stack, can present significant gains.

3.7.1 Background

A collaboration between Virginia Tech and University of New Mexico requires the transfer of

streaming data from the Long Wavelength Array (LWA) in New Mexico to Blacksburg, Virginia.

LWA is capable of generating 4 TB of data per hour, and data rates will grow by 53X as the

instrument roles out. Conventional methods of optimizing network bandwidth usage, which

focus on file transfers, do not apply per se to such streaming data.

To exacerbate the problem, TCP New Reno [106], the most widely deployed TCP congestion- control algorithm, delivers poor average throughput over paths with high bandwidth-delay

products. This results in poor bandwidth utilization. The algorithm is designed such that the

sender has to wait for the receiver’s feedback before the window sizes may be updated, and

therefore TCP New Reno cannot react fast enough over high-latency links to achieve higher

throughput.

Substantial research has been done to improve throughput over high-latency and high-bandwidth

links. This includes pre-staging data in temporary caches near users [107], optimizations such

as parallel streams [108] and buffer tuning [109], performance-enhancing proxies [110], over-

lay routing [111] and state-of-the-art congestion control algorithms [112–115]. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 86

Pre-staging data in temporary caches near users [107], though effective for file transfers, can- not be applied to streaming per se because the complete dataset does not exist a priori. While buffering may alleviate the problem for small data volumes, the challenge is exacerbated with

Big Data streaming where temporary caches are not large enough. For example, with the Long

Wavelength Array where data is generated at a rate of up to 4 TB per hour, it cannot be cached due to its sheer volume2.

Parallel streams [108] and buffer tuning [109] are mechanisms orthogonal to what we propose and can be used together to multiply the gains in improving average throughput.

State-of-the-art congestion control algorithms [112–115] improve average throughput but are 3 not widely deployed [116] and do not exhibit TCP friendliness and RTT fairness [117].

Cascaded TCP Connections Long Haul TCP Connection

Sender Relay Receiver (NRAO/VLA (4TB/hr)) (National LamdaRail) (Supercomputer at Virginia Tech)

Figure 3.15: The long-haul (end-to-end) TCP connection is split into two independent TCP connections by the relay, each with smaller latencies.

There is a need to address the root cause of the problem, that is to decrease the latency over which the the congestion control protocols operate. Though end-to-end latency is dictated by the speed of light in fiber,

We propose the use of TCP streaming relays to reduce the impact of large end-to-end latencies.

We do so by providing guidance for the deployment of TCP streaming relays so as to reduce the

2The LWA instruments can saturate 10 Gbps links today and data rates are expected to grow up to 53 X if instruments are fully deployed. 3 FreeBSD and OSX use TCP New Reno [106] as the default congestion control algorithm and Compound TCP [115] is disabled by default in Microsoft Windows Vista. However, Linux kernel 2.6.19 and onwards uses CUBIC [113] as the default algorithm CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 87 impact of large end-to-end latencies. As illustrated in Figure 3.15, a relay takes responsibility for pipelining traffic from the source to the destination. The connection from the source is terminated at the relay; the relay in turn creates an independent (or cascaded) TCP connection to forward data towards the destination.Cascaded'TCP:' As shown in Figure 3.16, thisB arrangementIG'Throughput'for' enables BIG'DATA'Applications'in'Distributed'HPC' the congestion-control algorithm of the TCP streams to react faster — due to smaller latencies + + 0 0.01 0.1 1 Umar%Kalim*,%Mark%Gardner* ,%Eric%Brown ,%Wu4chun%Feng*% 0 0.01 0.1 1 100 — thereby achieving higher aggregate bandwidth. In other words, in a chain of relays, end-to- + 100 0 0.01 0.1 1

Department%of%Computer%Science*,%Office%of%IT ,%Virginia%Tech,%% 4 10050 synergy.cs.vt.edu% end latency is divided into segments such that each segment would have a latency much less 50 4

{umar,%mkg,%brownej,%wfeng}@vt.edu% 50 0 4 0 than the end-to-end latency. Consequently, a divide-and-conquer approach allows the sliding- 0 Motivation% Results% 100 window protocols operating on either side of each% relay to receive control feedback much faster % • Even%when%utilization%is%low%(at%high%bandwidth4delay%products),%100 10050 8 than• theVanilla%TCP%implementations%are%typically%unable%to%saturate%links%over%long-haul connection, thus delivering better bandwidth utilization. We refer to such a • Increased%throughput%% Cascaded%TCP%shows%100%%improvement%in%throughput%50 8 Setup: One Relay 50 0 8 high%latency%and%high%capacity%paths%!%poor%utilization%% observed%with%Cascaded%TCP%% 120 0 100 setup• asCause:%congestion4control%feedback%is%coupled%with%acknowledgements:%%Cascaded TCP 118 . Note that such an approach would benefit both streaming data 120 Setup:0 One Relay Loss (%) [ ] % 10080 0 100 • Larger%the%latency,%the%longer%it%takes%to%increase%the%window%size% • Cross4over%point%is%between%% 8060 0 0.01 0.1 1

0 100 as well as typical file transfer. 6040 16 10050 • Higher%the%capacity,%the%longer%it%takes%to%saturate%the%link% 16 a%bandwidth4delay%product%% 4020 50

20 0 16 % of%32KB%and%64KB% 50 0 4 Timeout% Packet%Loss% Retransmit:% 0 0 % Congestion% Congestion% % 120 Window% Window% Slow%Start% 0 Again% 120100 % • As%bandwidth4delay%% 80 0.01 100 ssthresh% ssthresh% 100 100 0.01 % products%increase,%utilization%% 8060 32 10050 Congestion% 6040 50 32 % Slow%Start:% is%accentuated%before%we%% 20 (Mbps) Bandwidth 40 32 Exponential% Avoidance:% 50 0 8 % see%diminishing%returns%% 20 0 (%) Packet Loss 0 Increase% Linear%Increase% Time% Time%

0 (%) Difference Percentage % % 120 0 Percentage Difference (%) Difference Percentage 100 Congestion%control%protocol%reacts%slowly%to% Protocol%reacts%faster%to%changing% 120100 % • Cascaded%TCP%ameliorates%% (%) Difference Percentage 100 network%conditions%over%high%latency%links% conditions%over%low%latency%links% 80 0.1 100 64 10050

60 64 % impact%of%losses;%as%losses%% 80 0.1 40 50 64 60 16 Figure• 3.16:Bulk%data%transfer%solutions%cannot%be%applied%as%such%to%streaming%apps% Congestion control feedback is coupled with TCP acknowledgments and thus latency be- increase,%the%cross4over%% 50 0 4020 0 tween peers has a significant influence on the control signaling. 0 • Streaming%BIG%DATA%application%operating%over%WAN:% point%–%where%Cascaded%% 20 0 Bandwidth Utilization (%) 0 TCP%performs%better%–%% 120 100 Bandwidth Utilization (%) Type 100 100 128 120 (%) PercentageDifference moves%towards%smaller%% 50 10080 Type 100 128 Reducing the adverse impact of large latencies on throughput by using relays is not a novel Cascaded 1 50

bandwidth%delay%products%% 8060 128 32

Cascaded 1 40 Long Haul 50 0 idea. However the use of relays is not appropriate for all scenarios. Thus, there is a need to 60 0 (i.e.,%64KB%to%32KB)% 4020 Long Haul National%LamdaRail% 0 % NRAO/VLA%(4%TB/hr)% % 20 0

determine when Cascaded TCP may be applied. 0 8 8 8 8 16 32 64 16 32 64 16 32 64 16 32 64 Percentage Difference (%) Difference Percentage • Owing%to%relay%setup%time,%% 128 256 512 128 256 512 128 256 512 128 256 512 8 8 8 8 16 32 64 16 32 64 16 32 64 16 32 64 100 128 256 512 128 256 512 128 256 512 128 256 512 8 8 8 8 8 16 32 64 for%small%bandwidth4delay%% 16 32 64 16 32 64 16 32 64 16 32 64 128 256 512 128 256 512 128 256 512 128 256 512 128 256 512

1024 2048 4096 8192 Latency (ms) 16384 32768 65536 64

In this context our contributions are: 8 131072 16 32 64 50 products,%long4haul%TCP%% 128 256 512 Latency (ms) Hypotheses% 1024 2048 4096 8192 16384 32768 65536 % 131072 Latency (ms) results%in%better%utilization%% Bandwidth Delay Product (KB) 0 • Throughput%improves%when%layer44%relay(s)%are%used%to%convert%a%long4haul% (−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200] An analytical model for Cascaded TCP that provides guidance for when to use relays. than%Cascaded%TCP% Bandwidth Delay Product (KB) (−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200] TCP%connection%to%a%cascade'of'connections' • Setup: Two Relays, 32 Mbps Link Capacity Packet Loss (%) 100(Long4Haul%TCP%vs.%Cascaded%TCP:%Percentage%Differences%−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200] • Particularly%for%long4lived%TCP%connections%over%WAN%–%long%haul% 0 0.01 0.1 1 % 0 0.01 0.1 1 50 128 % Latency (ms) Latency (ms) 0 % 25 64 Related%Work% 25 64 % % 128

• 8 8 8 8 128 H.%Y.%Pucha16 %and%C.%Hu,%“Slot:%Shortened%Loop%Internet%Transport%32 64 16 32 64 16 32 64 16 32 64 128 256 512 128 256 512 128 256 512 128 256 512 20 using%Overlay%Networks,”%Purdue%University,%TR4ECE45412,%2005.% 20 • A.%Cohen,%S.%Rangarajan,%and%H.%LatencySlye (ms).%“On%the%Performance%of%TCP% • Improvement%in%performance%justifies%costs%of%layer44%processing%at%relays% Splicing%for%URL4Aware%Redirection”,%In%USENIX%Symposium%on% 15 15 Internet%Technologies%and%Systems%4%(USITS),%1999.%(−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200]

Approach%&%Experimental%Setup% 10 % 10 • Goal:%Characterize%performance%of%% Future%Work% Cascaded%TCP%vs.%Long4Haul%TCP%for%% Measured Bandwidth (Mbps) 5 • How%many%relays%are%needed%for%optimal%performance?% BIG%DATA%in%an%emulated%environment% Measured Bandwidth (Mbps) 5 • Is%the%overhead%of%relay%setup%within%tolerable%limits?% • Configuration:% • What%are%the%implications%of%for%TCP%semantics?%% • FreeBSD%v9.0%% • Are%results%also%applicable%to%other%congestion4control%algorithms?% • TCP%New%Reno% 0 1 2 0 1 2 0 1 2 0 1 2 • Use%the%Cascaded%TCP%framework%for%NRAO/VLA%data%transfer% • Dummynet%(%~%1000Hz)% % 0 1 2 0 1 2 0 1 2 0 1 2 • % Number of Relays netcat%% Layout%of%setup%with%one%relay% Number of Relays • iperf%% • Adding%two%relays,%further%improves%throughput% • As%high%as%90%%utilization%is%observed%when%the%E2E%latency%is%64ms%and% Acknowledgements% • A%conducting%node%sets%up%the%receiver,% % loss%is%0%%(Note:%protocol%overheads%limit%peak%to%~94%)% relay(s)%and%sender,%and%maintains%timestamps% Research%was%funded%in%part%by%Virginia%Tech%and%Juniper%Networks% CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 88

An evaluation of the use of TCP relays to improve aggregate throughput and to test the • hypothesis that the use of relays improves bandwidth utilization.

We validate our hypothesis via an empirical study.

3.7.2 Analytical Model and Cascaded TCP

Let Tcascaded be the time required to complete the transfer of data using a cascade of relays. We express this as

Tcascaded = Toverhead + Ttransfer (3.1) where Toverhead is the overhead of using cascaded TCP and Ttransfer is the time required to complete the transfer using relays. Note that Toverhead = Tsetup + Tproc, where Tsetup is the time required to setup a cascade of relays and associated TCP connections and Tproc is the processing overhead.

Given Tlh as the time required to transfer data using long-haul TCP, it would be prudent to choose Cascaded TCP if the overheads do not outweigh the benefits, i.e.,

Tcascaded < Tlh or (3.2)

Toverhead < Tlh Ttransfer. (3.3)

Transfer Time for Long-Haul TCP (Tlh)

We know that throughput is inversely proportional to latency between end-points. The rate of increase in throughput is also coupled with latency as the longer the latency, the longer it takes the congestion-control algorithm to increase the sender’s TCP window size and converge CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 89 towards the ideal bandwidth. The same conclusions can be derived from Mathis’ model [105]:

cMSS throughput , (3.4)  RTTpp

3 where MSS = maximum segment size, RTT = latency, p = loss probability, c = 2b such that b = 1 for long-haul TCP and b = 2 when delayed ACKs are enabled. q

If S is the size of data, then using (3.4), the average transfer time over long-haul TCP is ex- pressed as: S S(RTTpp) Tlh = . (3.5)  throughput cMSS

There are much more precise models than Mathis’ approximation such as that by Padhye et al. [119]. We chose Mathis’ approximation for its simplicity, though Padhye’s or other models may be used. However, the models cannot be applied to short-lived TCP flows as their entire lifetimes are usually within the slow-start phase.

Data Transfer Time for Cascaded TCP (Ttransfer)

Cascaded TCP may be classified as non-pipelined or pipelined, based on the mechanics of the relay [118]. The relay is non-pipelined when it stores all the traffic coming from the sender until the connection is closed. Once the sender’s connection closes, the relay starts forwarding traffic to the destination (or the next relay in the chain). It is understandable that such an approach would not be a viable option, particularly for streaming data. In contrast, a relay is pipelined when it is allowed to forward packets as soon as they are available in the queue.

To avoid packet drought in the buffer — when a relay does not have enough data to send — the relay’s outbound TCP connection may wait for W windows before starting transmission.

Therefore, transfer time Ttransfer or in particular the pipelined transfer time Tpc can be expressed CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 90 as: N S(RTTk pk) T T p W RTT , (3.6) transfer = pc c MSS + i  i 1 X= where for each i = 1,...,N, RTTi denotes the round-trip time for TCP connection i. The first term is the bottleneck link transfer time. From (3.4), the bottleneck link would be k = argmin i = { 1,...,N : RTTippi — that is the link with the lowest available bandwidth. The second term is } the buffering to prevent drought and comes from the first W end-to-end round-trip times that each connection waits before starting its transmission. As window scaling is enabled for sizes beyond 64 KB and it takes 16 round trip times during the slow start phase for the window to grow beyond 64 KB, we choose 16 as a default value for W in (3.6).

Note that both (3.5) and (3.6) depend on latency (RTT) and loss (p). Here the message size may be considered a constant when choosing between long-haul and cascaded TCP.

Setup Time for Cascaded TCP (Tsetup)

The time to setup relays has a strong correlation with latency between source and relays. This is because setup time involves sending configuration parameters to the relay to trigger setup.

Thus we can approximate setup time with the time it takes for the first payload segment to arrive. All relays along the path may be triggered in parallel. By doing so, the time to trigger relays that are closer to the sender is hidden by the time to trigger the relay that is the farthest.

For the sake of simplicity we use the latency RTT between the sender and the receiver. A TCP connection is typically setup after one-and-a-half round trip. The first payload may arrive along with the third segment in the 3-way handshake. Therefore:

Tsetup 1.5 RTT. (3.7) ⇡ CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 91

If this optimization were not applied and relays were setup in a sequence the overall setup N time would be the sum of setup times for all the relays, that is Tsetup i 1 1.5 RTTi, where ⇡ = N is the number of relays. P

Processing Overhead (Tproc)

The processing overhead may be approximated by the time the relay process waits to avoid buffer drought, which is accounted for in (3.6). Note that the overhead of data passing through the relay’s transport layer instead of being forwarded at layer 3 by a router would be negligible when compared to the waiting time to avoid buffer drought.

Summary

We can estimate Tlh, Ttransfer, Tsetup and Tproc from (3.5), (3.6) and (3.7), which allows us to estimate Toverhead and evaluate the condition (3.3). We present and discuss the throughput estimates in relation to empirical measurements in Section 3.7.5.

3.7.3 Experimental Setup

We use iperf v2.0.5 [120] to emulate the sender (client) and receiver (server) in our testbed.

The nodes run FreeBSD 9.0. The relays are implemented with netcat [121] as layer-4 relays.

Approach

We configure the operating system appropriately (e.g., set kernel frequency timer at 4000 Hz, define maximum window sizes such that they do not gate bandwidth, enable window scaling and selective acknowledgements). We compute the expected bandwidth-delay product for the CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 92 end-to-end path. This allows us to determine the transfer size for the combination of bandwidth capacity, end-to-end latency, and packet loss, to ensure that TCP connections remain in steady state for at least 90% of their lifetime. We then configure Dummynet pipes at the sender and relay(s) to emulate available bandwidth/capacity, end-to-end latency, and packet loss. We vary configurations for Dummynet pipes in the ranges listed in Table 3.4. We measure achievable bandwidth, latency, and packet loss and take between 10 and 30 samples for each permutation of the parameters to compute statistical significance.

Table 3.4: Values used to configure Dummynet and emulate testbed.

Metric Range of Values Round Trip Time (ms) 8,16,32,64,128,256,512 Bandwidth (Mbps) 0.512,1,2,4,8,16,32,64,128,256 Packet Loss (%) 0.001,0.01,0.1,1

Long-Haul (LH) TCP The long-haul TCP connections provide a performance baseline for our testbed. To measure baseline performance, we route through the same path by enabling layer-3 forwarding at the relays. This is done to make the emulation overhead the same for both long-haul and Cascaded TCP.For long-haul tests, the sender is configured to have the first relay node as its gateway. The client addresses the server as the receiver/destination. Here the transfer size bandwidth estimates are computed as follows: throughput = test duration .

Pipelined Cascaded TCP Netcat is setup at the relays to act as layer-4 forwarding gateways.

As data arrives at the relay, it is pipelined/forwarded to the next relay in the chain or to the destination if it is the last relay. In this case, we cannot assume that the test is complete when the client’s TCP connection terminates; the relay may still be forwarding data when the sender’s connection terminates. Therefore our measurements include the time until the last relay in the chain is done forwarding traffic to the server. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 93

We recognize that Dummynet’s emulation of bandwidth and latency is coarse-grained; Dum- mynet queues packets, which can increase delay in addition to what is configured. This coarse- grained emulation and bursty behavior is apparent in low-latency emulations where measured bandwidth results in observations greater than the limits defined.

3.7.4 Results

We collect and analyze results for all permutations of the metrics defined in Table 3.4 using one and two relays. Note: we take typical loss values from the PingER project that measures inter-regional Internet performance [104] .

Observations from the Testbed

Bandwidth utilization decreases with increase in latency As summarized in Figure 3.17, we see that bandwidth utilization decreases with increased latency. We see similar trends for long-haul and Cascaded TCP. Utilization decreases further as loss increases. These observa- tions remain valid for all varieties of bandwidth capacities we tested. We can derive the same conclusions from Mathis’ throughput approximation.

For low-bandwidth limits and latencies, we observe that the bandwidth utilization is at times greater than 100%, which is not possible in reality. We attribute this to the coarse-grained emulation by Dummynet. This behavior is most apparent for low-latency configurations. We compute the transfer size based on the bandwidth-delay product such that the TCP connection remains in streaming state for about 90% of the time; for small bandwidth-delay products, the transfer sizes are small and tests complete in a few seconds. Variations in emulation at such granularities have significant impact on the results. In contrast, we do not see the same behavior for tests with large bandwidth-delay products as the errors are amortized over the CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 94

120 100 80 60 0 40 20 0

120 100

80 0.01 60 40 20 0

120 100

80 0.1 60 40 20 0 Bandwidth Utilization (%) 120 100 Type 80

Cascaded 1 60 40 Long Haul 20 0 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072

Bandwidth Delay Product (KB)

Figure 3.17: Bandwidth utilization decreases with increase in latency and/or link capacity — i.e., band- width delay product. duration of the test. We could have increased transfer sizes to amortize the effect of these errors. While doing so for short latencies amortized the errors, for large latencies the test durations became unreasonably long. An alternate approach would be to use different transfer size proportions for short and long latencies; in this case, however, we would not be doing a fair comparison across configurations.

We note that if loss and latency configurations remain the same and link capacities increase, the throughput achieved has a slight but noticeable decrease. This observation may be explained as follows. Assume a link with given bandwidth capacity and uniform loss. Here the window size CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 95 will be able to grow to a certain percentage of the maximum size before loss is experienced and subsequently the window size drops. If behavior remains the same, the percentage of maximum window size achieved is less for a high capacity link. This is observed in the empirical results.

Throughput improves by introducing layer-4 relay(s) Figure 3.18(a) shows that introduc- ing a single relay results in significant improvement in bandwidth utilization due to better average throughput. By introducing a relay halfway, we split the connection into two TCP con- nections. Each split connection has shorter latency and therefore results in better aggregate throughput and thus better bandwidth utilization. Figure 3.18(b) provides a summary of the relative difference in bandwidth utilization. Cascaded TCP with a single relay delivers up to twofold improvement in bandwidth utilization.

Figure 3.18(a) shows that as latency increases, Cascaded TCP achieves increasing bandwidth utilization. The relative difference continues to grow and goes beyond 100% in some cases, as seen in Figure 3.18(b). For example with latencies of 256 ms at 8 Mbps, Cascaded TCP continues to be twice as efficient as compared to long-haul TCP. The same is observed for bandwidths as high as 32 Mbps and latencies 64 ms, which is in fact the same bandwidth- delay product as the former example. We see similar trends with varying losses too.

We observe smooth trends in our results except for select cases in Figure 3.18(b) of 32 Mbps capacity and 128 ms latency. This apparent anomaly is due to the observed value being just over the lower bin limit and hence an artifact of binning.

We note that throughput drops at high bandwidth-delay products and we see diminishing re- turns with one relay. We anticipate that if more relays are added, while assuming that latency is equally split between them, we would continue to see increasing utilizations. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 96

(0,50] ● 100 ● ● ● ● 100 (50,100] ● 4 4 50 ● 50 (100,150]

0 0 100 ● ● ● ● 100 ● 8 8 50 ● 50 ● 0 0 100 ● ● ● 100

● 16 16 50 ● 50 ● ● 0 0

100 ● ● 100

● 32 32 50 ● 50 ● ● ● 0 0

100 ● 100

● 64 64 Bandwidth Utilization (%) 50 ● (%) Difference Percentage 50 ● ● 0 ● ● Type 0 100 100

● 128 128 ● Cascaded 50 ● 50 ● Long Haul ● ● ● ● 0 0 8 8 16 32 64 16 32 64 128 256 512 128 256 512

Latency (ms) Latency (ms)

(a) Bandwidth utilization (b) Percentage increase of bandwidth utilization between Cascaded and long- haul TCP

Figure 3.18: Bandwidth utilization with a single relay for varying link capacities (4 – 128 Mbps) and latency (8 – 512 ms) at a loss of 0.1%. The 95% confidence intervals are not shown here as all obser- vations fall within 5% of the mean. ±

Multiple relays continue to improve throughput Figure 3.19 shows that using two relays results in further improvement in throughput and therefore utilization. With two relays, Cas- caded TCP achieves approximately 90% utilization when the end-to-end latency is 64 ms and loss is 0.1%. This is reasonably near the theoretical limit of about 94% when compared to a connection with maximum utilization — the limit of 94% is experienced due to protocol overheads. We see the same trends for all cases of losses. Here again, we see that median bandwidth utilizations for Cascaded TCP are always significantly better than that of long-haul

TCP. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 97

0 0.01 0.1 1

Latency (ms) 25 64 128

20

15

10

Measured Bandwidth (Mbps) 5

0 1 2 0 1 2 0 1 2 0 1 2

Number of Relays

Figure 3.19: Results for link capacity of 32 Mbps, latencies of 64 ms and 128 ms and losses of 0%, 0.01%, 0.1% and 1% are presented to compare long-haul and cascaded TCP with one and two relays. Zero relays imply long-haul TCP.

Cascaded TCP performs well with high losses With Cascaded TCP,we achieve better through- put even when losses are high because the latency for the split TCP connections is shorter, al- lowing the congestion-control algorithm to react more quickly. This delivers better throughput and therefore better bandwidth utilization. With zero losses, Cascaded TCP enables through- put up to twice as much as long-haul TCP connections; this is observed even when latencies are as high as 256 ms. Bandwidth utilization results for varying losses are summarized in

Figure 3.20.

Note that for low loss, the improvement in throughput for two relays over one relay is not of the same magnitude as it is for one relay over no relay (or long-haul TCP), as shown in Figure 3.19.

However, as loss increases, we see that the magnitude of improvement is similar for two relays as compared to one and one relay as compared to none. At high losses, the congestion-control protocol does not allow the window sizes to grow because of losses and therefore we are able Cascaded'TCP:'BIG'Throughput'for'BIG'DATA'Applications'in'Distributed'HPC' + + 0 0.01 0.1 1 Umar%Kalim*,%Mark%Gardner* ,%Eric%Brown ,%Wu4chun%Feng*% 0 0.01 0.1 1 100 + 100 0 0.01 0.1 1

Department%of%Computer%Science*,%Office%of%IT ,%Virginia%Tech,%% 4 10050 synergy.cs.vt.edu% 50 4

{umar,%mkg,%brownej,%wfeng}@vt.edu% 50 0 4 0 0 Motivation% Results% 100 % % CHAPTER• 3. Even%when%utilization%is%low%(at%high%bandwidth4delay%products),% SESSION-BASED100 COMMUNICATION MODEL 98 10050 8 • Vanilla%TCP%implementations%are%typically%unable%to%saturate%links%over% • Increased%throughput%% Cascaded%TCP%shows%100%%improvement%in%throughput%50 8

Setup: One Relay 50 0 8 high%latency%and%high%capacity%paths%!%poor%utilization%% observed%with%Cascaded%TCP%% 120 0 100 • Cause:%congestion4control%feedback%is%coupled%with%acknowledgements:%% 120 Setup:0 One Relay Loss (%) % 10080 0 100 • Larger%the%latency,%the%longer%it%takes%to%increase%the%window%size% • Cross4over%point%is%between%% 8060 0 0.01 0.1 1

0 100

6040 16 10050 • Higher%the%capacity,%the%longer%it%takes%to%saturate%the%link% 16 a%bandwidth4delay%product%% 4020 50

20 0 16 % of%32KB%and%64KB% 50 0 4 Timeout% Packet%Loss% Retransmit:% 0 0 % Congestion% Congestion% % 120 Window% Window% Slow%Start% 0 Again% 120100 % • As%bandwidth4delay%% 80 0.01 100 ssthresh% ssthresh% 100 100 0.01 % products%increase,%utilization%% 8060 32 10050 Congestion% 6040 50 32 % Slow%Start:% is%accentuated%before%we%% 20 (Mbps) Bandwidth 40 32 Exponential% Avoidance:% 50 0 8 % see%diminishing%returns%% 20 0 (%) Packet Loss 0 Increase% Linear%Increase% Time% Time%

0 (%) Difference Percentage % % 120 0 Percentage Difference (%) Difference Percentage 100 Congestion%control%protocol%reacts%slowly%to% Protocol%reacts%faster%to%changing% 120100 % • Cascaded%TCP%ameliorates%% (%) Difference Percentage 100 network%conditions%over%high%latency%links% conditions%over%low%latency%links% 80 0.1 100 64 10050

60 64 % impact%of%losses;%as%losses%% 80 0.1 40 50 64 60 16 • Bulk%data%transfer%solutions%cannot%be%applied%as%such%to%streaming%apps% increase,%the%cross4over%% 50 0 4020 0 0 • Streaming%BIG%DATA%application%operating%over%WAN:% point%–%where%Cascaded%% 20 0 Bandwidth Utilization (%) 0 TCP%performs%better%–%% 120 100 Bandwidth Utilization (%) Type 100 100 128 120 (%) PercentageDifference moves%towards%smaller%% 50 10080 Type 100 128 Cascaded 1 50 bandwidth%delay%products%% 8060 128 32

Cascaded 1 40 Long Haul 50 0 60 0 (i.e.,%64KB%to%32KB)% 4020 Long Haul National%LamdaRail% 0 % NRAO/VLA%(4%TB/hr)% % 20 0

0 8 8 8 8 16 32 64 16 32 64 16 32 64 16 32 64 Percentage Difference (%) Difference Percentage • Owing%to%relay%setup%time,%% 128 256 512 128 256 512 128 256 512 128 256 512 8 8 8 8 16 32 64 16 32 64 16 32 64 16 32 64 100 128 256 512 128 256 512 128 256 512 128 256 512 8 8 8 8 8 16 32 64 for%small%bandwidth4delay%% 16 32 64 16 32 64 16 32 64 16 32 64 128 256 512 128 256 512 128 256 512 128 256 512 128 256 512

1024 2048 4096 8192 Latency (ms) 16384 32768 65536 64 8 131072 16 32 64 50 products,%long4haul%TCP%% 128 256 512 Latency (ms) Hypotheses% 1024 2048 4096 8192 16384 32768 65536 % 131072 Latency (ms) results%in%better%utilization%% Bandwidth Delay Product (KB) 0 • Throughput%improves%when%layer44%relay(s)%are%used%to%convert%a%long4haul% (−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200] than%Cascaded%TCP% Bandwidth Delay Product (KB) (−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200] TCP%connection%to%a%cascade'of'connections' Setup: Two Relays, 32 Mbps Link Capacity Packet Loss (%) 100(Long4Haul%TCP%vs.%Cascaded%TCP:%Percentage%Differences%−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200] • Particularly%for%long4lived%TCP%connections%over%WAN%–%long%haul% 0 0.01 0.1 1 Figure 3.20: Cascaded TCP continues to perform well in spite of high losses. % 0 0.01 0.1 1 50 128 % Latency (ms) Latency (ms) 0 % 25 64 to see the benefits of having relays.Related%Work% The more relays we add, the more we alleviate the impact 25 64 % % 128

of losses. However• at8 low losses, throughputs8 increase8 up to link8 capacity and are gated and 128 H.%Y.%Pucha16 %and%C.%Hu,%“Slot:%Shortened%Loop%Internet%Transport%32 64 16 32 64 16 32 64 16 32 64 128 256 512 128 256 512 128 256 512 128 256 512 20 using%Overlay%Networks,”%Purdue%University,%TR4ECE45412,%2005.% 20 thus the benefits of increasing the number of relays is less evident. In Figure 3.19, for 0.01% • A.%Cohen,%S.%Rangarajan,%and%H.%LatencySlye (ms).%“On%the%Performance%of%TCP% • Improvement%in%performance%justifies%costs%of%layer44%processing%at%relays% Splicing%for%URL4Aware%Redirection”,%In%USENIX%Symposium%on% 15 loss, we see that the average measured throughput for 128 ms latency configuration is less for 15 Internet%Technologies%and%Systems%4%(USITS),%1999.%(−100,−50] (−50,0] (0,50] (50,100] (100,150] (150,200] two relays as compared to one relay. This anomaly is because of temporary self-congestion Approach%&%Experimental%Setup% 10 % 10 induced by the sender. Apart from this anomaly, we did not observe this behavior for other • Goal:%Characterize%performance%of%% Future%Work% Cascaded%TCP%vs.%Long4Haul%TCP%for%% configurations. Measured Bandwidth (Mbps) 5 • How%many%relays%are%needed%for%optimal%performance?% BIG%DATA%in%an%emulated%environment% Measured Bandwidth (Mbps) 5 • Is%the%overhead%of%relay%setup%within%tolerable%limits?% • Configuration:% • What%are%the%implications%of%for%TCP%semantics?%% • FreeBSD%v9.0%% • Are%results%also%applicable%to%other%congestion4control%algorithms?% • TCP%New%Reno% 0 1 2 0 1 2 0 1 2 0 1 2 • Use%the%Cascaded%TCP%framework%for%NRAO/VLA%data%transfer% • Dummynet%(%~%1000Hz)% % 0 1 2 0 1 2 0 1 2 0 1 2 • % Number of Relays netcat%% Layout%of%setup%with%one%relay% Number of Relays • iperf%% • Adding%two%relays,%further%improves%throughput% • As%high%as%90%%utilization%is%observed%when%the%E2E%latency%is%64ms%and% Acknowledgements% • A%conducting%node%sets%up%the%receiver,% % loss%is%0%%(Note:%protocol%overheads%limit%peak%to%~94%)% relay(s)%and%sender,%and%maintains%timestamps% Research%was%funded%in%part%by%Virginia%Tech%and%Juniper%Networks% CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 99

Cascaded TCP measurements correlate with Mathis’ model Figure 3.21 shows measured bandwidths with respect to Mathis’ approximation of the upper limit on bandwidth. We see that our empirical results for long-haul TCP have a strong correlation with Mathis’ model, until we hit the link capacity — where throughput is capped. In other words, empirical obser- vations show that beyond certain percentages of available bandwidth, the link capacity starts becoming bottleneck. As mentioned earlier, when throughputs are gated by link capacity, the measurements are reasonably near the theoretical limit of about 94%.

Cascaded TCP has a steep slope, indicating that Cascaded TCP results in better throughput than what Mathis’ model predicts, based on the end-to-end latency. The steep slope is ob- served because Cascaded TCP effectively doubles the achieved bandwidth by allowing the congestion-control algorithm to react faster; this assumes that the link capacity supports higher bandwidths.

Link Capacity (Mbps) ● 0.512 1 2 4 8 16 32 64 128

Cascaded Long Haul

100 80 60 40 20

●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●● ● ● ● ● Measured Bandwith (Mbps) 20 40 60 80 20 40 60 80 100 120 100 120

Mathis Approximation of Bandwidth's Upper Limit (Mbps)

Figure 3.21: Measured bandwidths are analyzed with respect to Mathis’ approximation of upper limit on bandwidth when using TCP.The y-axis represents the measured bandwidth (Mbps) where as the x-axis represents the bandwidth estimated by Mathis’ approximation [105]. Each line represents estimates at given link capacities. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 100

We observe that throughputs experienced by a long-haul TCP connection for a particular con-

figuration are about half of that observed by Cascaded TCP (with one relay). This effect is profound at larger latencies (e.g., 128 ms and 256 ms). We discuss the reasons for this behav- ior in Section 3.7.5.

Case Study using PlanetLab

We conducted a case study on real networks using the PlanetLab [122] testbed. We evaluated the use of a single relay on different network paths, some of which are listed in the Table 3.5.

These paths included inter-continental links. Our findings across these network paths were similar.

Table 3.5: PlanetLab paths tested as part of a case study.

Client Relay Server planet3.cs.ucsb.edu planetlab-1.cs.uic.edu planetlab-1.ing.unimo.it planetlab01.cs.washington.edu server1.planetlab.iit-tech.net planetlab-01.vt.nodes.planet-lab.org planetlab1.iitkgp.ac.in planetlab-1.imperial.ac.uk planetlab1.sfc.wide.ad.jp planetlab1.eurecom.fr planetlab-01.vt.nodes.planet-lab.org planet-lab1.cs.ucr.edu planet-lab1.cs.ucr.edu planetlab-01.vt.nodes.planet-lab.org planetlab1.cs.ucl.ac.uk

Note that with PlanetLab, as we operate in the live network, the cross traffic inhibits opti- mal bandwidth utilization. Also, there may be unknown circumstances (e.g., asymmetric link capacities) and events that may or may not influence the behavior of the traffic flowing across.

Table 3.6: Findings from a select case study

Metric Value Mean latency, client to server 448.07ms 1.6ms Mean latency, client to relay 194.74ms ±0.58ms Mean latency, relay to server 295.06ms ±0.34ms Packet loss, all paths 0%± Long-haul bandwidth µ = 5.85Mbps, CI : (5.25,6.39) Pipelined bandwidth µ = 6.44Mbps, CI : (6.00,6.98) p-value, 0 : µp = µlh 0.011⇤ H CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 101

Consider the path: from planetlab1.iitkgp.ac.in, via planetlab-1.imperial.ac.uk, to planetlab1.sfc.wide.ad.jp. The performance measurements are summarized in Ta- ble 3.6. We see that the difference in throughput achieved by Cascaded TCP as compared to long-haul TCP is statistically significant. Here the difference is limited to about 10%, which can be explained by the use of default window sizes for TCP connections from the hosts. We were unable to reconfigure the hosts to use larger window sizes due to limited administrative access.

These default window sizes gated the sender window sizes from growing to accommodate the larger bandwidth-delay product. This resulted in lesser gains as compared to experiments in an ideal environment.

3.7.5 Discussion

When does Cascaded TCP become viable?

As observed in Figures 3.18(a) and 3.18(b), pipelined Cascaded TCP is viable for almost all configurations except for small bandwidth-delay products. In Figure 3.18(a), we see that a sin- gle relay continues to be efficient for bandwidths as high as 64 Mbps (at 64 ms) and 128 Mbps

(at 32 ms). Beyond that, to remain efficient (i.e., obtain greater than 80% utilization), a second relay is needed, as illustrated in Figure 3.19.

As discussed in Section 3.7.2 using (3.5), (3.6) and (3.7), we can evaluate condition (3.3), which allows us to determine if using Cascaded TCP would be feasible. We compute estimates for throughput using the Cascaded TCP model and compare them with measured throughput.

Results for 128 Mbps link capacity are presented in Figure 3.22.

We see that the analytical model provides acceptable approximations for achievable bandwidth.

The errors can be partly explained by the simplifying assumptions we use to approximate over- CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 102

0.001 0.010 120 ● ● 100 ● ● 80 ● ● ● 60 ● 40 ● ● 20 ● ● 0 0.100 1.000 120 Type 100 ● ● Cascaded.Measured Throughput (Mbps) 80 Cascaded.Estimated 60 Long.Haul.Measured ● 40 Long.Haul.Estimated ● ● 20 ● ● ● ● ● 0 ● ● ●

8 16 32 64 128 256 8 16 32 64 128 256 Latency (ms)

Figure 3.22: Estimated and measured throughput results are presented for link capacity of 128 Mbps with losses of 0.001%,0.01%,0.1% and 1%. In the model we use loss of 0.001% to approximate 0% loss. Note that bandwidth-delay products are proportional to latencies which are shown in this figure. heads. Note that Mathis’ model is also an approximation upon which we base our model. If

Cascaded TCP were based on a more precise model (e.g., [119]) it would yield better esti- mates. Nevertheless the analytical model allows us to make an informed decision whether to use Cascaded TCP or not. As with Mathis’ model, the predictions beyond link capacity are not valid and for negligible losses (i.e., 0.001%) the TCP model overestimates throughput (e.g., throughput estimates for 32 ms latency and 0.001% loss in Figure 3.22).

The relative differences computed from empirical results (also shown in Figure 3.18(b)) for the same configuration highlight that we achieve approximately 100 percent improvement, which is what the model predicts. Averages of select permutations are presented in Table 3.7. CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 103

In contrast to large bandwidth-delay product scenarios, we note that the model predicts mini- mal improvement in throughputs for low bandwidth-delay products.

In summary, we observe that Cascaded TCP is beneficial when the bandwidth-delay product is greater than 32 KB, which incidentally is less than the the typical default limit for window sizes — FreeBSD and other operating systems typically have 64 KB as the default TCP window size.

Table 3.7: Predicted and measured throughput (link capacity of 8 Mbps)

Latency Loss Long Haul (Mbps) Cascaded (Mbps) (ms) (%) Mathis Measured Predicted Measured 0.1 4.99 4.53 9.43 7.28 64 1.0 1.58 2.05 3.10 3.81 0.1 2.49 3.21 4.85 5.31 128 1.0 0.79 1.03 1.56 1.90

How many relays do we need?

We can modify Mathis’ throughput approximation to accommodate relays. If we assume homo- geneity, the throughput would be as shown in (3.8), suggesting that the maximum throughput achieved would be that of the bottleneck. We may represent the latency for the bottleneck link as RTT/(N + 1) and loss as p/(N + 1), where N is the number of relays and for the sake of simplicity, loss is assumed to remain the uniform across links. Therefore, we have:

MSS 1 throughput R3/2 K, (3.8) = RTT p = Ç N+1 å N 1 ! + q CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 104

MSS where R = N +1, and K = . Subsequently, RTTpp

@ B 3 R1/2K. (3.9) @ R = 2

Equation (3.9) is a monotonically increasing function. This implies that if we continue to increase the number of relays that we should expect bandwidth utilization to increase until the capacity limits are reached. In Figure 3.21, we see that if we introduce a relay, we effectively double the throughput. This is until the link capacity is reached. While this approximation may be true theoretically, it is not so practically. From a cursory case study of available locations to setup relays between Virginia and New Mexico we concluded that we would be able to setup at most six to eight relays.

Where should the relays be located?

For simplicity we assume uniform spacing for the relays — for example one relay is setup half way between the sender and the receiver, similarly two relays are located such that the latency for each layer-4 hop is about one third the end-to-end latency between the sender and the receiver. This may not be a practical assumption. Our model and empirical results show that maximum improvement in performance is experienced when the relays are equally spaced.

If they were to be placed close or farther to the source the benefits would be reduced. In this case the bottleneck bandwidth will be dictated by the segment with the longer RTT and the benefits would decrease in proportion to the ratio of the longer segment’s RTT and the long-haul connection’s RTT.

We also assume that the relays are located along the same path the long-haul TCP traffic tra- verses. It may not be possible to always locate a relay along the same path the long-haul traffic traverses. Also, an alternate path may have different loss, latency and capacity configurations CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 105 therefore incur additional overheads.

Does Cascaded TCP maintain end-to-end semantics?

The end-to-end semantics of TCP are broken when using Cascaded TCP because the connection is split into a cascade of independent TCP connections, which are put together by the relays to form a logical connection. We note that end-to-end semantics are also broken by middleboxes such as firewalls and NATs. The concerns we face in maintaining end-to-end semantics with

Cascaded TCP relays are no more than what we already experience with middleboxes. As long as the risks of using such relays/middleboxes are understood, the benefits of increased throughput outweighs the concerns. Nevertheless this may not apply to all situations.

Does Cascaded TCP maintain TCP friendliness?

The expectations from congestion control algorithms are that they maintain high bandwidth utilization, RTT fairness and end-to-end semantics. They are also expected to be TCP friendly.

As Cascaded TCP continues to use TCP New Reno, it maintains both TCP friendliness and RTT fairness while improving network throughput.

BIC, CUBIC and Compound TCP do maintain end-to-end semantics and improve upon band- width utilization (when compared with TCP New Reno), however, they do not maintain TCP friendliness and RTT fairness for large latencies [117].

Should the use of Cascaded TCP be hidden from the user and how can it be deployed?

Whether the use of Cascaded TCP relays is apparent to applications or not depends on how the transport layer implements the solution. What is important to note is that there is a need CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 106 to make a decision based on context in which communication occurs — for example based on latency, loss, link capacity experienced by the connection and availability of relays. This decision process may be transparent and automated by a daemon implementing the analytical model as described in Section 3.7.2, subsequently aiding the network stack. or it may be controlled by the end user as was suggested by Border et al. [110] by implementing policies.

As explained in Section 3.7.3 we used an expedient method to establish a proof of concept.

Ideally, to support the transport layer a framework would be required to identify potential relays, setup transport connections between the relays and manage communication.

3.7.6 Conclusions

We have shown that we can improve bandwidth utilization by reducing the impact of end-to- end latency on typical congestion-control protocols. We do so by introducing layer-4 forward- ing relays, which allows us to split a TCP connection into two or more TCP connections that are cascaded together to form one logical end-to-end connection. As each segment’s congestion- control algorithm operates independently and reacts to feedback from the receiver much faster than that of the long-haul TCP connection, Cascaded TCP enables greater overall throughput and thus better bandwidth utilization. We present an analytical model that allows us to make an informed decision as to when the use of Cascaded TCP would be viable. We present and evaluate the results of the analytical model and our empirical tests. We conclude that intro- ducing relays results in a significant increase in throughput for many practical scenarios.

It is evident that with the knowledge of the context in which communication happens — e.g., over wide area networks with large latencies — effective mechanisms can be developed to setup complex communication constructs. Although for the Cascaded TCP case study, we manually configured the relays, however, with the availability of SLIM’s control channel, this process CHAPTER 3. SESSION-BASED COMMUNICATION MODEL 107

may be easily automated. Upon detection of large latencies, applications may choose to use

Cascaded TCP instead of legacy TCP for large volume transfers and this can be facilitated by

implementing specialized verbs to be used over the control channel.

3.8 Summary of Session-Based Communication Model

In this chapter we have proposed an extensible session-layer intermediary, called SLIM, built

upon explicit session, flow, and end point abstractions. The separation of session and transport

concerns leads to clearer descriptions of communication patterns and enables advance network

functionality: mobility, communication between two or more participants, and dynamic recon-

figuration. Even more important, SLIM control channels provide a mechanism for revitalizing

network stack innovation by providing a new options space. A prototype SLIM implementation

has been created to explore various design alternatives.

There are several avenues that may be explored as part of future work. We may explore the ben-

efits of more internal cross-layer communication within the stack to allow better coordination.

We may further expand the set of control channel verbs with a view to creating mechanisms

upon which applications can build policies and functionality. We may also explore which por-

tions of SLIM should be in kernel space and which should be in user space leading to a more

production- quality high-performance implementation. We may modify more applications to

utilize SLIM, particularly its advanced functionality, and report an empirical performance eval-

uation of the implementation. Chapter 4

Enabling Extensions to the Network Stack

Today, underneath virtually all networked applications, the Transmission Control Protocol

(TCP) plays a significant role in delivering data across the network. TCP has been wildly successful and its simplicity and ease of implementation has resulted in wide adoption, which has materially contributed to the growth of the Internet [3]. This is also evident from the us- age of network resources at Virginia Tech, where more than 80% of the Internet traffic is TCP based, as shown in Figure 4.1.

Along with ubiquity comes increased expectations. Where users were once content with simple

file transfers between machines, now they expect the network to provide functionality that

TCP was never designed to support. Nevertheless, the need for greater functionality in TCP continues to grow [9–12,18–24,26–37,40–45,53,54,56,57,65,77,123,124]. As we too have shown in Chapter 3, the benefits of extensions to the network stack can result in significant gains. Unfortunately, TCP alone neither supports such functionality nor does it appear that it can be modified to support the new functionality in a backwards compatible way [9–11,22,23]. In the belief that radical changes are required to extend its functionality, some researchers advocate a clean-slate approach as the only path forward [13,16,55,57, 125].

108 CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 109

(a) Volume of traffic exiting Virginia Tech campus

(b) Volume of TCP traffic exiting Virginia Tech campus

(c) Volume of UDP traffic exiting Virginia Tech campus

Figure 4.1: A highlight of the ratio of TCP based traffic to the aggregate volume of traffic exiting the Virginia Tech campus on 30th March 2016. Note that UDP consumes about 10% of the share as the Chrome browser uses the QUIC protocol [77] to communicate with Google servers using UDP — which is another testament to the need for extending the network stack. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 110

To benefit from the legacy of TCP,as well as laying the groundwork for future extensions to the

network stack, it is important that all proposals for extensions must be backwards compatible

with the legacy network stack, in particular TCP.In this chapter, we discuss and demonstrate

how future extensions to the network stack may be developed in a backwards compatible man-

ner, thus enabling incremental adoption. We show that by the insertion of a simple hook, TCP

can be made significantly more extensible; furthermore, it is possible to do so in a backwards-

compatible manner. To substantiate our claims, we demonstrate a case study of enabling virtual

machine migration beyond a subnet while maintaining network connectivity. We also study the

impact of middleboxes on our approach of adding extensions to TCP.In this case the extension

was to enable fault tolerant transport connections in the face of disconnections.

We build upon the work of Ford [9] and Iyengar [124] and introduce a limited isolation bound- ary between TCP and the application. The purpose of the isolation boundary is to decouple an

application’s data stream from the underlying TCP transport flow to allow protocol designers

freedom to extend TCP to implement new functionalities. The lightweight mechanism utilizes

a set of TCP options, referred to as the Isolation Boundary Options (IBOs), during the connec-

tion setup phase, and provides for the creation of a control channel that endpoints can use to

negotiate additional functionality, as appropriate, during the lifetime of the connection. Since

the presence or absence of the new options at connection setup time indicates whether or not

a stack implements the extension mechanism, adoption can be incremental. Connection setup

falls back to legacy behavior if either stack fails to recognize the new options.

4.1 Approach Towards TCP Extensions

The research most closely related to ours includes [9, 10, 123, 124]. Ford et al. [123] suggest using TCP options to compose an application stream from multiple transport flows. These CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 111 transport flows may then be mapped onto different network paths (if available). Iyengar et al. [9, 10, 124] discuss a logical refactoring of the transport layer to form the semantic, isola- tion, flow, and endpoint sub-layers in an architecture they refer to as Transport Next Generation

(Tng), as shown in Figure 4.2. The refactoring highlights the fact that over time a variety of roles have been coalesced into the transport layer. These roles may be broadly classified as follows: the identification of transport endpoint; performance-related functions such as con- gestion control; mapping of transport endpoint to entity; and end-to-end semantic functions such as data ordering.

While the authors approach the challenge of network extensions as purely a transport layer problem, we approach the problem from a holistic perspective. We suggest that there are at least two facets of the challenge: 1) to enable modern communications, we need to develop a model that can describe contemporary use cases with the help of suitable abstractions, and

2) we need to build such extensions on top of existing network stacks in a manner that is backwards compatible and thus allowing incremental adoption. We discuss the session-based communication model in Chapter 3. In this chapter we tackle the second facet, that is how do we implement the extensions in a backwards compatible manner.

Our proposed modifications are between the transport and application layers in the TCP stack, corresponding to the boundary between the flow and isolation sub-layers in the Tng archi- tecture. The isolation boundary decouples the application data stream from the TCP flow in a backward-compatible manner. This decoupling, along with the setup of a control channel, paves the way for substantial extensions to TCP.While this work does not address any of the aforementioned individual problems directly, it creates a framework upon which solutions can be composed.

Note that the session layer implementation, presented in Chapter 3, fits between the applica- tion and transport layers, while using the isolation-boundary extensions. Other examples of CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 112

Isolation Boundary

Application Semantic Transport Isolation Internet Flow Link Endpoint Physical

TCP/IP Stack Tng Layers

Figure 4.2: The Isolation Boundary in the context of the TCP/IP stack and the Tng layers.

extensions limited to the transport layer may be composing an application stream from multi-

ple transport flows or setting up a hybrid transport along the Internet path. We maintain that

the isolation boundary defined here is a suitable mechanism upon which to build the isolation

sub-layer in Tng.

4.2 Proposed Solution

We propose to extend TCP via a set of TCP options, called Isolation Boundary Options (IBOs),

to provide a flexible and dynamic mechanism for creating a larger class of extensions.

The IBOs are a “hook” for introducing future extensions. Specifically, the IBOs serve two pur-

poses: (1) decouple the application data stream from the TCP flow that provides transport

by creating a logical transport-independent flow that is mapped onto the transport-dependent

(TCP) flow, (2) establish a control channel for composing mappings between application data

streams and the transport-independent flows in a much more flexible and dynamic way than

provided by TCP options. Ways in which the mechanism can be used to implement additional

functionality will be discussed later in the chapter, but first we consider the semantics of the

isolation boundary. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 113

4.2.1 Concept and Semantics

We define an IBO to contain two pieces of information: (1) an ID to identify a transport- independent logical flow, and (2) a sequence number from an appropriate sequence space. The

ID, denoted the Transport-Independent Flow ID (TIFID), is unique in the context of the participat- ing stacks.1 As with TCP,sequence numbers orients a protocol data unit in the application data stream. They are also used to acknowledge data that has been received. Sequence numbers used for the former purpose are called Transport-Independent Sequence Numbers (TISeq) and those used for the latter are called Transport-Independent Acknowledgment Numbers (TIAck).

Maintaining Backwards Compatibility TCP stacks advertise that they implement the iso- lation boundary by specifying an IBO during connection setup. If both hosts specify an IBO, then the isolation boundary functionality is enabled. Otherwise, both fall back to legacy TCP.

In this way, backward compatibility is maintained, and there is no requirement that all hosts be updated for the extensions to work.

Transport-Independent Flow Setup

Consider the sequence diagram shown in Figure 4.3 for PeerA and PeerB. A TIFID unique to both stacks is needed to identify the logical flow. One approach is to allow each stack to select one half of the TIFID. In this case, PeerA defines the first half of the TIFID using a random value. It also initializes the TISeq number, using a random value, to define its transport-independent sequence space and establishes a mapping between the TISeq and the TCP sequence number.

These partial TIFID and initial TISeq are sent to PeerB in the SYN packet containing an isolation

1TIFIDs are not session IDs per se. A session would consist of a composition of transport-independent logical flows and would have its own logical identity. As shown in Chapter 3, sessions are an example of additional functionality that can be implemented with the proposed mechanism. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 114

Peer A Peer B

Choose TIFID_A and TISeq_A SYN + TIFID_A + TISeq_A Record TIFID_A, TISeq_A, and SYN + ACK + TIAck_A + choose TIFID_B, TIFID_B + TISeq_B TISeq_B Record TIFID_B, TISeq_B ACK + TIAck_B

TIAck are zero when not in use

Figure 4.3: Sequence diagram of the exchange of Isolation Boundary Options during connection setup. boundary option.

Upon receipt of the SYN packet, PeerB defines the second half of the TIFID using a random value and also defines its TISeq number using a random value to establish its transport-independent sequence space. Finally, it sends the completed TIFID and its TISeq back in the SYN+ACK TCP header, making sure to acknowledge the TISeq it received from PeerA using the TIAck field.

Upon receipt of the reply, PeerA notes the completed TIFID, which uniquely identifies the flow. It returns an ACK packet as the final phase of the three-way handshake, making sure that it acknowledges the SYN it received using a TIAck.2 At this point, transport-independent flows in each direction have been established, along with the associated bidirectional TCP connections.

The transport-independent flows constitute a limited control channel through which the two stacks are able to negotiate and coordinate additional functionality in TCP. An illustration of the logical flow to transport connection mapping is shown in Figure 4.4.

2Note that unlike conventional TCP options, the IBOs are acknowledged in the above exchange as part of the three-way handshake. If the options in either direction were removed, by a middlebox perhaps, then both stacks would become aware of it and fall back to legacy behavior. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 115

Time!

!Sockets!!!! !!!!!!!!!!!!!!!!!!!!Transport!Independent!Flow!(TI!Flow)! !!!!!API! Mapping!TI!Flow!to!TCP! !Isolation! Boundary! Synchronizing!TI!Flow! !!!!!!!!!Transport!Connection!A! with!TCP!using!TIFID! Transport! !!!!!!!!!!Disruption! !(e.g.,!VM!migration)! !!!!!!Transport!Connection!B! TIFID!

Figure 4.4: An illustration of the transport-independent flow mapping to TCP connections.

Transport-Independent Flow Close

Next, we consider how to close a transport-independent flow and release resources. The pri- mary mechanism for a graceful close is a message exchange on the control channel. Alter- natively, a graceful close of the underlying TCP connection, i.e., in response to a FIN packet, causes the state associated with the transport-independent flow to be cleaned up. Ultimately, as with any protocol operating over an unreliable communications channel, the use of pro- tocol timeouts are unavoidable. A timeout triggered within TCP would propagate up to the transport-independent flow and either cause it to close or cause it to attempt a reconnection.

Re-Synchronization

At any time during the life cycle of the transport-independent flow, the connection may be re-synchronized by exchanging IBOs in a new TCP three-way handshake. Typically, IBOs are exchanged again when flows resume operation after a TCP disconnection or when communica- tion becomes impossible due to an address change. After re-synchronization, both stacks may safely discard state associated with the old TCP connection. Because the data structures were created when the isolation boundary was set up during the original three-way handshake, the transport-independent flows can be synchronized to a new TCP connection. The procedure CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 116 for resuming operation after disconnection is the same as for creating a new connection ex- cept that the previously completed TIFID is used instead. Since the TIFID is already complete, the receiving stack looks up the isolation boundary information corresponding to the complete

TIFID and resumes the flow rather than create a new one. The exchange of SYN and SYN+ACK packets in this case allows the stacks to re-synchronize by exchanging the TISeq numbers where they left off at the time of the disconnection. Re-synchronization attempts are also validated by TISeq numbers that logically fit within the current state of the flow similar to how TCP validates sequence numbers.

Use of the Control Channel

At the conclusion of a successful setup phase, a control channel exists between the two stacks.

At this point, data channels to serve the application’s data stream will be set up subordinate to the control channel. These are set up in a similar fashion to the control channel and are separate transport-independent flows. Requests and responses that implement additional functionality on top of TCP are communicated across the control channel. The following are some of the possibilities. In each of the cases, the control channel provides a mechanism for composing or manipulating transports of various kinds.

Resuming After a Disconnection. The ability to resume after a disconnection is a direct con-

sequence of implementing the Isolation Boundary and is discussed earlier in the section.

Multihoming. Such support can only be possible if we decouple flow identification from the

transport endpoint identification. Since we can do so given the Isolation Boundary, we

can construct a virtual flow which may be mapped to transport connection over different

networks. This requirement was also identified in Multipath TCP [123].

Sophisticated possibilities of striping a virtual flow onto multiple transport connections CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 117

(operating over different network paths) may also be realized. In these cases, the ques-

tion of how to map the transport independent sequence space onto multiple TCP se-

quence spaces must also be carefully considered.

Migration. Given the extensions of resuming after a disconnection and support for multihom-

ing, we can envision the possibility of migrating the mapping of a virtual flow from one

transport connection to another.

Hybrid Transports. With a control channel in place, the communicating peers can construct

a hybrid transport. The control channel allows the peers to share information regarding

appropriate application gateways. While the peers converse over a (typical) packet-switc-

hed transport connection, the respective gateways may be requested (by the peers) to

setup a circuit on their behalf. Once the circuit is in place, the peers may setup transport

connections to the application gateways and later migrate the virtual transport for the

packet-switched transport to the hybrid transport connection.

It is envisioned that the protocol will be extensible in order to remain flexible in the face of future requirements. Clearly, setup, tear down, reconfiguration messages are required. The means for determining the capability of the remote peer are also necessary.

Lightweight Isolation Boundary Operation

Up to this point, the isolation boundary option defined for TCP provides increased function- ality by creating a control channel that is used by the stacks to negotiate and implement new functionality. Not all transport connections need the full extensibility (and heavier weight) of a control channel. We therefore define a variant of the IBO which still provides transport independence for data but does not create an out-of-band control channel. To distinguish between the two variants, we call the first Isolation Boundary Option – Control (IBO-C), the CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 118 variant discussed so far, and the second Isolation Boundary Option – Data (IBO-D). Both types are established in the same way as discussed earlier in the section. IBO-C allows protocol de- signers great flexibility in adapting TCP’s behavior. Normally IBO-D is used for a subordinate data channel, but in lightweight operation has no associated control channel. Two stacks ne- gotiate the use of light-weight operation if either stack advertises an IBO-D during the initial

flow setup. This admits a simpler transitional implementation.

The IBO-D establishes an opaque flow onto which an application data stream is mapped. Be- cause the flow is opaque, the only capability added to TCP by the option is the separation of the transport-independent flow from the underlying transport connection. However, this is sufficient for IBO-D to support resuming after disconnection and migration. These capabilities are useful to applications even if there is no need for any other functionality.

Mapping Sequence Spaces

A one-to-one mapping of the transport-independent to the TCP sequence space is straightfor- ward. Connection setup establishes the initial mapping. During a transfer, the sequence num- bers advance in synchrony as data is successfully acknowledged by the transport layer. Because of the implicit synchronization, there is no need to explicitly send the TISeq and TIAck numbers after the three-way handshake.

The synchronization between the TISeq and corresponding TCP sequence numbers is lost if the transport connection is lost. During reconnection, the correspondence between the TISeqs and the new TCP sequence numbers is re-established, thereby resuming reliable communications at the same point where the flows left off in the application data streams.

We note that other mappings between the transport independent and TCP’s sequence numbers are possible depending upon the functionalities being implemented by the stack over the con- CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 119 trol channel. For example, multiple data connections may be setup and data striped across the the connections to take advantage of multiple paths for resilience or throughput purposes.

4.2.2 The Wire Protocol

To assess the feasibility of a backward-compatible TCP isolation boundary, we now imagine one possible implementation of the TCP isolation boundary option.

We define an option with explicit fields for the TIFID, TISeq, and TIAck. The full compliment of fields are likely not needed in each phase of connection establishment. As such, the bit

field definitions below can be considered a worse case, consuming most of the remaining TCP option space during connection setup. A more frugal mapping of concept to bits is certainly possible.

A TCP header, as shown in Figure 4.5, may consist of up to 40 octets of options. Over the years, a number of options have been defined. Hence, the space available for new options has become constrained. Note that even though almost the entire remaining options space during establishment is consumed for the IBO, the role of TCP options to allow for extensibility can now be assumed by the control channel in a far more flexible manner. We discuss this further in Section 4.5.

At connection setup time, there are already four TCP options in common use: window scaling, time stamps, maximum segment size, and selective acknowledgments permitted. Factoring in the 19 octets these options require, 21 octets out of 40 are still available during connection establishment. A simple approach uses 20 of the remaining octets to implement the isolation boundary.

The first field of 48-bits contains the Transport-Independent Flow Identifier (TIFID) which labels CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 120

Bit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

0 Source Port Destination Port

32 Sequence Number

64 Acknowledgement Number

C E U A P R S F 96 Data Reserved W C R C S S Y I Window Offset R E G K H T N N 128 Checksum Urgent Pointer

160 Isolation Option Tag Length TIFID 1 TIFID 2 Bit Offset

192 TIFID 3 TIFID 4 TIFID 5 TIFID 6

224 TISeq 1 TISeq 2 TISeq 3 TISeq 4

256 TISeq 56TISeq TIAck 1 TIAck 2

288 TIAck 3 TIAck 4 TIAck 5 TIAck 6

Figure 4.5: The proposed transport-independent flow option. the flow independent of the underlying transport. As the TIFID only needs to be unique within the context of the two end hosts, the requesting process specifies a locally unique value for the

first half of the TIFID and the responding process later specifies the second half of the TIFID.

Thus the TIFID is guaranteed to be unique to both stacks. During the time that the TIFID is partially specified, the second half is set to zero.

The isolation boundary between the upper protocol layers and the transport is further strength- ened by two 48-bit protocol-independent sequence spaces, one for each flow direction.3 As with TCP,the two endpoints select initial Transport-Independent Sequence Numbers (TISeq) dur- ing the three-way handshake. Transport-Independent Acknowledgement Numbers (TIAck) are returned to acknowledge the receipt of the SYN packets.

The TISeq are mapped onto the protocol-dependent sequence numbers of the underlying (TCP)

3The TIFID and the sequence number fields were chosen to be as large as possible as a compromise between providing better support for large congestion windows and the number of option bits available. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 121

transport and remain synchronized with them as long as the transport connection is active.

When a transport sequence number is incremented, so is the TISeq.

4.3 Discussion

Any change to TCP will appropriately be met with a critical eye. Protocols and practices have

evolved from the widespread use and development of the Internet creating dependencies on

TCP.We have argued that TCP should admit a smooth transition to new functionality in order

to minimize the cost of transition for network operators and users. In this sense we now turn

our attention to the interaction with existing protocols and practices. We assume for the sake

of this analysis the strawman wire protocol previously described.

4.3.1 TCP Option Space

In considering the plausibility of an IBO for TCP,it is critical to assess the space available in the

TCP header. We focus on options that are considered mandatory or in common use in imple-

mentations. As a distinction, the validity of many options depends on the state of the connec-

tion. During the three-way handshake the following options need to be supported: Maximum

Segment Size (RFC793, four octets) [1], Window Scaling (RFC1323, three octets) [126], Se-

lective Acknowledgment Permitted (RFC2018, two octets) [127], and Time Stamp (RFC1323,

ten octets) [126]. Based on our analysis, there is sufficient room for the 20 octets that the two isolation boundary options require. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 122

4.3.2 Incompatible Options

Due to the limited TCP option space, not all options can be supported simultaneously. Here we address several other options that are valid during the three-way handshake. We will disregard the Alternate Checksum Option (RFC1146) [128] and the Partial Ordering Option

(RFC1693) [129] since according to the TCP Roadmap (RFC4614) [130], there is a lack of interest in these protocols. The TCP Roadmap also notes that T/TCP (RFC1644) [131] has a serious defect. TCP MD5 (RFC2385) [132] and the follow-on TCP Authentication Option

(RFC5925) [133] are used to protect BGP and LDP and hence are likely not to benefit from an isolation layer. Since these are concerned with protecting the infrastructure itself and are not used for user traffic, we need not concern ourselves with compatibility. The last protocol we consider is the Quick-Start Response (RFC4782) [134]. This protocol is experimental, and it remains to be seen if there will be widespread adoption. If the IBO and Quick-Start were both to come into common usage, then the contention will need to be resolved by omitting some option from the SYN and SYN/ACK packets.

4.3.3 Performance

The lack of field alignment, regardless of which option causes it, may lead to degraded per- formance for some network stacks due to the misaligned memory accesses that may require individual octet manipulations. In the case of the IBO, a peer stack may see degraded perfor- mance whether or not it supports isolation. Since the IBO is only valid during the three-way handshake, their processing is off the critical data path and thus should not adversely affect performance. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 123

4.3.4 Simplicity

With regard to simplicity, the isolation boundary directly implements only (1) the decoupling of a transport independent flow from the transport connection, and hence, from the network endpoint identifiers involved, (2) provides a means for keeping track of where the conversation is in the flow, and (3) provides for the establishment of a control channel upon which additional functionality can be built. The immediate implication is that multiple TCP connections can be associated with one flow over its lifetime with multiplexing opportunities in both time and space. Clearly the first two make re-synchronization possible.

4.3.5 SYN Cookies

SYN cookies [135] mitigate a serious vulnerability in TCP.A server must maintain state in its SYN cache for each connection attempt received. An attacker can easily exploit this and fill up the SYN cache by simply crafting TCP SYN packets to overwhelm the server. Normally the

SYN cache records the state that is required to establish a TCP connection. The IBO would also have to be recorded in the SYN cache. When a server is under attack, it instead responds with a SYN cookie that maintains minimal state and allows the server to continue to serve new connection requests but in a degraded mode of operation. Like most options, the IBO will not be preserved when a server is operating under attack. Because the isolation boundary is backward compatible, a server in this mode will continue to operate in a classic TCP fashion.

4.3.6 Middleboxes

A TCP connection that makes use of the IBO behaves identically to a TCP connection that does not. The primary concern with respect to the IBO is what happens when a legacy system CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 124 sees an IBO option. If one of these systems encounters an as yet unknown IBO, it may do one of two things. It may either remove the offending option or it may discard the offending packet. From the perspective of not degrading responsiveness as perceived by the user, the first option is preferred. The first option will simply cause TCP to fall back into a classic operation mode. The second, however, will be a needlessly dropped packet that forces the user to wait for retransmission in order to fall back to a legacy mode of operation. This would be perceived by the user as unresponsive but could be mitigated by the client sending two SYN packets, one with and one without the IBO. The client would then prefer the connection that supported

Isolation and simply reset the other. We continue this discussion later in Section 4.5.

4.3.7 Security

As a design goal, the isolation boundary should have security characteristics that are no worse than TCP. The primary vulnerability of use of the IBO is that it allows re-synchronization of a connection from any network address. If an attacker knows critical information about the current connection state, it is possible to hijack an existing connection from anywhere else on the Internet, but an attacker must know the TIFID and the TISeq numbers of the current set of unacknowledged data.

In order for an attack to be successful, the attacker must have knowledge of the full conver- sation from the point of instantiation. The TIFID is only exposed in the three-way handshake and cannot be easily guessed due to its length. The TISeq numbers need to be derived from the initial sequence numbers, the current TCP sequence numbers, and the fact that TCP sequence number roll-over may have occurred. In other words, the attacker needs to know the TIFID and the count of all the octets that pass by in either direction in order to falsify the reconnect request. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 125

If the attacker is not on the data path between the parties in the connection, then the attacker

must have a collaborator that is. This is no worse than TCP connection hijacking without the

IBO in that the attacker in this case also needs to be in the data path.

4.3.8 Application Compatibility

Not only does an extended TCP that utilizes an isolation boundary need to be backward-

compatible with other peer implementations, but it also needs to be backward-compatible with

applications. Since in all other respects the semantics of TCP have not changed, an application

that is unaware of other functionality that might be enabled by the isolation boundary will

continue to operate correctly when using a TCP with the IBO. In fact, this application will gain

some benefit in being able to re-synchronize a lost connection.

4.4 Case Study Exemplifying TCP Extensions: Virtual

Machine Migration Beyond Subnets

To highlight the benefits of extending TCP to enable future innovation, we discuss a case study

below. We explain how the legacy network stack and the current technologies that support live

migration of virtual machines (VM) require that the VM retain its IP network address. As a

consequence, VM migration is oftentimes restricted to movement within an IP subnet or entails

interrupted network connectivity to allow the VM to migrate. Thus, migrating VMs beyond sub-

nets becomes a significant challenge for the purposes of load balancing, moving computation

close to data sources, or connectivity recovery during natural disasters. Conventional solutions

use tunneling, routing, and layer-2 expansion methods to extend the network to geographi-

cally disparate locations, thereby transforming the problem of migration between subnets to CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 126 migration within a subnet. These approaches, however, increase complexity, involve consider- able human involvement, and most importantly, do not address the source of the problem — i.e., the limiting assumptions of TCP implementations whereby the IP addresses are used as part of the 4-tuple to identify a transport connection.

We make the case that decoupling IP addresses from the notion of transport endpoints is the key to solving a host of problems, including seamless VM migration and mobility. We demonstrate that VMs can be migrated seamlessly between different subnets — without losing network state

— by presenting a backward-compatible prototype implementation and a case study.

4.4.1 Background

Virtual machine (VM) migration has served as a technology to enhance resource allocation and utilization. In turn, it has ushered in the cloud computing era, whether for load-balancing pur- poses to eliminate hotspots [136], moving computation close to data sources [137], or failover planning. Seamless VM migration across networks opens additional doors for opportunities.

Consider, for example, the possibility of a live migration of financial services hosted in a data center on the east coast of the United States. In an impending disaster, these services may be migrated to a data center on the west coast without interrupting existing network connections, which may have originated from within or outside the data center. While the concerns of live

VM migration within a data center have been successfully resolved to a large extent, the issues of live VM migration beyond a subnet are yet to be addressed.

The seamless migration of VMs involves a host of challenges such as transferring VM im- ages [136, 138], managing storage [137], copying intermediate state [139], maintaining net- work connections [138, 140], addressing security considerations [141], meeting performance goals [136], facilitating operations management, and so forth. Here, we focus on the issue of CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 127 maintaining network connection state following a live VM migration beyond a subnet.

Contemporary hypervisors rely on the Reverse Address Resolution Protocol (RARP) to main- tain network connection state following a migration within a subnet. However, migration be- yond a subnet is a challenge as the IP addresses of the network interface normally change.

This change results in a disruption in continuity as existing connections timeout because the endpoint with the old IP address does not exist anymore. Unless applications implement re- connection, migration results in disconnections. This precludes the straightforward migration of VMs to geographically disparate locations (e.g., between data centers in different regions).

When using isolation boundary, the decoupling of transport flow labels from the network ad- dresses ensures that a change of IP address does not impact the connection state. This is because the connection will be identified by a label independent of the IP address (rather than by the traditional 4-tuple of IP addresses and ports). Such independence not only enables VM migration, but it also facilitates features such as mobility and reliable connectivity [72].

Conventional solutions, such as tunneling [138], modified routing [140], and layer-2 expan- sion [142–144], work around the problem of connection loss due to a change of IP address. Specifically, in these cases, the layer-2 network is extended to geographically disparate loca- tions. This approach transforms the problem of migration of VMs between subnets to VM migration within a subnet, which is already well understood.

However, extending a subnet to geographically disparate data centers is complex and requires considerable human involvement [138,140]. Methods of expanding the layer-2 network [142–

144] may not perform well at large scale (e.g., the scale of the Amazon cloud), where the differ- ent subnets are to form parts of a single layer-2 domain. With tunneling [138], routing [140], and some layer-2 expansion methods (e.g., [144]), a coupling exists between the source subnet and the migrated VM, which is an undesirable constraint. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 128

Moreover, network architects typically partition networks within a data center into multiple subnets, particularly for large-scale data centers and clouds (e.g., Google or Amazon). Whether for flexibility of design, partitioning services, managing reliability or failover, balancing load, managing operations (e.g., maintenance), or maintaining security, partitioning allows the ar- chitect to have more degrees of freedom. Therefore, by enabling layer-3 migration (vs. layer-2 migration), we can avoid the migration constraints that cloud services may have due to cou- pling with the hardware (or layer-2 domain), particularly when data centers are partitioned into subnets.

Thus, rather than continuing to address the symptoms of the problem as above, we focus on the source of the problem — the use of IP addresses in the 4-tuple to identify a transport connection

— in order to enable seamless layer-3 migration of VMs across networks.

4.4.2 Existing Approaches of VM Migration

Following a migration within a subnet, hypervisors currently rely on the Reverse Address Res- olution Protocol (RARP) protocol [145] to enable continued use of existing TCP connections. After migration, the VM continues to use the IP addresses that it was configured with in the source subnet. On the other hand, the physical MAC addresses of the new host will be different.

RARP renews the mapping between the VM’s IP addresses and the host’s MAC addresses. As a result, the migration process appears seamless. Here we assume that the downtime during the migration is such that the network connection does not timeout.

Apart from the issues of transferring VM images and managing its state and storage, the chal- lenge of migrating a VM beyond a subnet (in contrast to within a subnet) is maintaining ap- propriate network (TCP) connection state. The methods used to address the challenge of live migration while maintaining network state are different manifestations of tunneling [138], CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 129 routing [140] and layer-2 expansion [142–144].

Tunneling

In their tunneling-based approach, Bradford et al. [138] use IP tunneling [146] to redirect net- work traffic from the source to the destination subnet and therefore avoid network disruption.

Upon migration, an IP tunnel is setup, and traffic is tunneled to the destination subnet after migration is complete. As soon as the VM is initialized at the destination subnet, it acquires a new IP address for its interface and can respond to incoming traffic meant for the new address as well as the old address — the interface is setup to respond to both addresses. Dynamic DNS is used to update the IP address of the services hosted in the VM. The tunnel is terminated after all connections using the old address are shut down. The above approach, however, requires cooperation from the source server. In addition, until the old connections terminate, a coupling exists between the migrated VM and the source subnet.

Software Defined Networking

Erickson et al. [140] demonstrate that OpenFlow allows applications/VMs to continue to use their old IP addresses even when they are migrated to different subnets. The migration tools and hypervisors deal with the issues of transferring the VM, while OpenFlow is used to direct traffic to the destination subnet. If the setup is automated, it may not take long to configure the forwarding tables. However, here too, the source of the problem — the coupling between transport connection labels and IP addresses — is not addressed, instead the solution works around the problem. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 130

Layer-2 Expansion

The VXLAN [142] framework creates an overlay network over layer-3 by encapsulating the en- tire layer-2 frame. VXLAN Tunnel End Points (VTEPs) may be used to expand a layer-2 network to geographically disparate locations. For example, two geographically disparate data centers could be made to be one layer-2 network using a VTEP at each data center. The VTEPs would be responsible for encapsulating traffic with a VXLAN header, forwarding it, and subsequently decapsulating the layer-2 frame when transferring it to the destination host. Note that as with

OpenFlow, the VXLAN solution requires additional complexity in the form of VTEPs.

Another manifestation of the approach of having a single layer-2 network is to have a large layer-2 domain. As the Spanning Tree Protocol (STP) has stability issues when the layer-2 domain grows too large, protocols such as TRILL [143] enable large layer-2 domains by re- placing STP.In essence, TRILL applies the Intermediate System to Intermediate System (ISIS) protocol to route Ethernet frames. Though vendors support such solutions, the debate is still open whether TRILL would benefit data center implementations or result in poor data center designs [147].

Similarly, a virtual private LAN service (VPLS) [144] over IP/MPLS has been used as a method to expand the layer-2 domain. Typically, LAN segments are brought together by virtualizing a switch across a link — making multiple switches appear as a single virtual switch extending over geographical distances. Here too, the layer-2 domain is allowed to grow to large sizes by managing multiple but localized spanning trees.

While the above approaches are sound, they are fundamentally ad-hoc solutions because they treat the symptoms of the problem rather than the problem itself. In contrast, Salz et al. [12] and Snoeren et al. [20] have suggested decoupling flow labels from IP addresses. However, their approaches have not been adopted because they are not backward compatible and require CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 131 a “flag day” for deployment.

Furthermore, the primary purpose of the above approaches is not to support VM migration; instead their contribution is elsewhere. For example, with VXLANs, the primary motivation is to expand the VLAN address space. VXLANs can go well beyond limit of 4094 logical networks that can be setup with VLANs — with the 24-bit segment ID, 16-million layer-2 VXLAN networks can exist in a common layer-3 subnet. Similarly, TRILL replaces STP to allow much larger layer-

2 domains as well as better link utilization; layer-2 links, which may have been ignored to avoid loops, may be used for better load distribution and thus bandwidth utilization.

By leveraging our research on TCP extensions, we demonstrate a new approach that decouples

flow labels from IP addresses and enables the continued use of existing TCP connections fol- lowing a migration. Unlike Salz’s solution, our approach is backward compatible with legacy

TCP stacks. (However, such stacks will not gain the benefit of the extended features.)

4.4.3 Challenges for VM Migration

Current approaches to VM migration are limited because they use IP addresses and ports to identify the TCP connection. However, IP addresses are meant to identify the network interface of the host. Therefore, overloading the use of an IP address to also identify a TCP connection binds the connection, for its lifetime, to that IP address alone. If the IP address were to change

(for reasons of mobility or migration), the transport connection which uses the old IP address would break. This is because the interface would have a new IP address and the transport connection labeled with the old IP address is not valid. The hypervisor (or application) would have to setup a new connection with its peer, using a new socket, to continue communication.

This limitation does not allow seamless VM migration because migrating to a different sub- net requires the acquisition of new IP addresses. This results in the termination of existing CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 132 connections and requires the set-up of new connections. As mentioned earlier, there are ad- hoc methods that allow continued use of old addresses. In contrast, we propose a clean and practical approach that eliminates the dependence on IP addresses for labeling transport con- nections.

With the isolation boundary [72], which enables rich extensions to TCP,we leverage the capa- bility of decoupling transport endpoint identifiers from IP addresses and thereby address the issue that a change in the IP address breaks TCP connections. With the isolation boundary, we create a notion of an abstract flow, which we refer to as a transport-independent flow (TI

flow) to emphasize that it is different than a TCP flow. The TI flow is identified by a transport- independent flow identifier (TIFID). As the TIFID is independent of the underlying network addresses, a change in the IP addresses does not invalidate the connection. Instead, the map- ping of the TIFID to the (new) IP address is updated.

TIFIDs are options that are exchanged during the connection setup phase; we refer to such options as isolation boundary options. A mapping between the abstract flow’s sequence space to that of the TCP connection is created and synchronized at key stages during state transition.

Figure 4.4 shows an illustration of how the isolation boundary enables the mapping of TI flow to

TCP connection. It also identifies when synchronization of state occurs following a disruption.

4.4.4 Methodology

Here we discuss the required functionality and mechanics for enabling layer-3 migration, along with our associated prototype. We then present our experimental setup and case study for migration between subnets. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 133

Required Functionality

There are three fundamental functionalities that we need to enable layer-3 migration:

1. The connection between the client and the service needs to be independent of the un-

derlying transport. The notion of a logical flow enables layer-3 migration, and changes

in IP address will not disrupt existing TCP connections.

2. Following a migration, a mechanism is needed for the VM to realize that network con-

figuration needs to change. This would happen when the VM is migrated and resumed.

Ideally, this would be implemented as part of the hypervisor — perhaps as part of the

virtual driver. (This may also be viewed as a need for cross-layer communication.)

3. Once the VM that hosts the server has migrated to a different subnet (with help via

the hypervisor), the network interface gets a new IP address. The abstract flow to TCP

connection mapping can be updated at the VM. However, the client would not be aware of

the server’s change of IP address, and thus, there is a need to update the client’s mapping

of the abstract flow to the TCP connection. Once done, the client continues operation.

Mechanics

To understand the mechanics of layer-3 migration using the isolation boundary, we explain the processes involved before, during, and after the migration. Figure 4.6 illustrates the steps involved in the migration.

1. Connection Setup: The client contacts the server that is hosted in a VM. During connection

setup (i.e., three-way handshake), the isolation boundary options are exchanged. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 134

2. Suspension: Upon the decision to migrate, the hypervisor hosting the VM suspends the

VM’s activity and records ephemeral state. The VM image and intermediate state are

copied over to the destination.4

3. Resumption: Once the VM is migrated with its image and ephemeral state, it is resumed.

At this point, the VM still has the old network configuration.

4. Synchronization: Once the VM migration has taken place, the server contacts the client

with a SYN message. Here, the TCP SYN request would be with the new IP address. The

client recognizes the TI flow because of the accompanying TIFID in the SYN message

and allows connection setup instead of replying with a reset or ignoring the request. The

isolation boundary synchronizes the logical sequence space over the new TCP connection.

In the scenario where the VM hosting the server migrates, which is discussed later in this

section, the hypervisors of the VM triggers synchronization with the client. However,

if the client moves, the client would trigger synchronization. If both move, additional

bookkeeping is required.

5. Continued Operation: The application continues interacting with the service. Seamless

migration from the perspective of the client and server is complete.

Implementation

To realize the functionalities identified above, we create a prototype with the following compo- nents: (1) isolation boundary, (2) link-status daemon, and (3) synchronization agent. (Other implementations are possible.)

4 This process may be optimized to achieve live migration [139]. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 135

4! 2! !!resume!VM! !!!!pause!VM! (with!new!IP)! !!!!ssh!server! !!migrate!VM! 3!

porthos! athos! 198.51.100.233! 192.0.2.228! 5! 1! resume!ssh!connection! !setup!ssh!connection! !!!!ssh!client! aramis!

Figure 4.6: An illustration of the steps involved and where they take place during a VM Migration.

A prototype implementation of the isolation boundary is available for the FreeBSD v8.1 kernel; some details of the implementation were discussed earlier in this chapter and more are to follow in the next section. These are also documented in our research papers [72,73].

Figure 4.7 presents the addition of isolation boundary options to TCP and their relation to its state-transitions. Note that Isolation boundary options are sent between the highlighted states.

To update network configuration of a VM following a migration, we chose an expedient method of a daemon that monitors for changes in link status when the VM is resumed and generates an interrupt- like notification. To synchronize state information, an agent at the client listens for a new TCP connection with a complete TIFID. The synchronization updates the mapping of the TIFID to TCP connection (with the new IP address) for both the service and the client.

As an artifact of our implementation, the stack in the client role listens on the same port through which it has an established connection. The listening port is available only for SYN messages with complete TIFIDs (not the partial TIFIDs typical of SYN segments establishing the initial connection) so that existing TI flows may updated with a mapping to the new addresses. An ideal approach would be to implement such functionality as part of the hypervisor. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 136

Timeout CLOSED

Active open/SYN Passive open Close Close LISTEN

SYN/SYN + ACK Send/SYN

SYN/SYN + ACK SYN RCVD SYN SENT

ACK SYN + ACK/ACK ACK Close/FIN ESTABLISHED

Close/FIN FIN/ACK

FIN WAIT 1 CLOSE WAIT ACK

ACK Close/FIN FIN + ACK/ACK CLOSING FIN WAIT 2 LAST ACK

ACK

FIN/ACK TIME WAIT

Figure 4.7: TCP State Transition Diagram with the addition of Isolation Boundary Options (i.e., TIFID_- A, TISeq_A, TIAck_A). Isolation boundary options are sent between the highlighted states. The arrows indicate state transitions. The transitions are labeled with actions and message types (e.g., SYN, ACK). A transition may be labeled as < cmd >/ or < pkt >/. For example Send/SYN implies that a send command was received and a SYN message was sent, where as SYN+ACK/ACK implies that a SYN+ACK message was received and an ACK message was sent. Successful delivery implies transition to where the arrow leads. TIAck are 0 when not in use (e.g., upon Active open, TIAck_B is 0). The dotted boxes indicate close commands (i.e., both passive and active). CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 137

Experimental Setup

To study correctness and establish a proof of concept, we setup FreeBSD v8.1 images with isolation-boundary-enabled kernels. We use VMware Player as the hypervisor and Ubuntu v12.04.1 LTS as the host OS. An SSH server acts as the service/application and a client aramis connects over the network to the SSH server. We chose SSH as the application to validate the correctness of the TI flow to TCP connection mapping, both before and after the migration, as any misaligned or lost byte in the encrypted bitstream would break the SSH-over-TCP connec- tion.

As illustrated in Figure 4.6, the SSH server is hosted in a VM deployed at the host athos in the subnet 192.0.2.0/24. The VM migrates to the host porthos, which is setup in the subnet

198.51.100.0/24. These subnets represent networks in different buildings on the Virginia Tech campus.

Demonstration

We study the scenario where we migrate a VM between buildings, which contain different subnets. In this setup, the client aramis connects to the VM hosted at athos over SSH and executes different jobs for test purposes. The VM at athos is then suspended.

Implementing live migration requires involvement of hypervisors, which we did not choose to do for our prototype demonstration as we used a proprietary hypervisor to demonstrate the general applicability of the approach. Instead, to emulate live migration, we copy the static image hosted at athos to porthos in advance. After suspension, we use rsync to copy the intermediate state saved by the hypervisor.

Following the transfer of intermediate state, the VM is resumed at porthos. Upon resumption, CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 138 the link-state daemon notices the change in link states and reconfigures the network interface.

The daemon is setup to generate a DHCP request to acquire network configuration parameters whenever the interface comes up. Subsequently, the synchronization agent takes action to synchronize the existing TI flow by setting up a TCP connection with the client using the new

IP address.

After synchronization, the client at aramis continues interaction with the server (migrated to porthos), oblivious to the fact that the server has migrated to a different subnet. Thus, SSH connectivity was not interrupted.

4.4.5 Discussion & Evaluation

Here we compare and contrast our proposed approach with existing solutions and discuss our case study.

Layer-3 vs. Layer-2 Migration

Whether migrating VMs within a data center or between data centers, the challenge remains the same. In either case, the possibility of layer-3 migration adds to the design flexibility available to system and network architects as it allows for a clean separation of concerns. Such flexibility applies to both communication scenarios: when communication originates from within the data center and outside the data center.

Whether we use tunneling [138], routing [140], or layer-2 expansion [142–144] the intention is to convert the problem of migrating VMs between subnets to the problem of migrating within a subnet. In other words, the problem of layer-3 migration is converted into a problem of layer-

2 migration. As discussed earlier, layer-2 expansion proposals are not meant to address VM CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 139 migration. Therefore by using a method that was developed for a different purpose, we are not addressing the problem that limits migration beyond the subnet, instead we are working around the problem. VXLANs [142] are aimed at expanding the VLAN address space and creat- ing an overlay network over Layer-3 by encapsulating the entire Layer-2 frame. Similarly, with

TRILL [147], the intention is to replace STP to improve resource utilization by using links that were ignored by STP when avoiding loops. Note that such cases introduce requirements such as the need to increase MTU sizes to accommodate outer headers for VXLAN and deployment of hardware for both VXLANs and TRILL.

Such an approach puts a constraint on the design of the network as well. As we discussed earlier, partitioning the network into subnets adds to the design flexibility. Therefore, if trans- port connections are independent of network labels we would have the best of both worlds, that is, enabling seamless live migration beyond networks as well as flexibility in the design of networks.

With the isolation boundary, we tackle the source of the problem, which is the coupling of naming abstractions. In effect, the solution does not require dealing with layer-2, but instead enables layer-3 migration. TIFIDs present a label that is independent of the network address.

Subsequently, a change in the network address has no impact on connectivity.

Downtime and Latency

Minimizing the time for live migration is desirable. Automating the process of configuring forwarding tables using OpenFlow incurs a negligible increase in the downtime for a live mi- gration. Similarly with IP tunneling, setting up the tunnel incurs minimal overhead, which may be minimized further by optimizations. Layer-2 expansion methods (i.e., VXLAN, TRILL,

VPLS/MPLS) also do not incur any more downtime than what is necessary to transfer interme- CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 140 diate state.

In these cases, if all communication destined for a service hosted by the migrated VM originates within the subnet, then the downtime may not be more than what is necessary for transfer of intermediate state. However, if communication happens with elements outside the subnet, then migrating a VM to a geographically disparate location requires traffic to be routed through the source subnet. This would incur additional latency for the existing network connections

(whose state was maintained following the migration). It is not be feasible to advertise new routes to the outside world for the portion of the subnet that has been migrated to the new location.

With a hypervisor implementing the isolation boundary, the network interface would need to acquire a new IP address for the migrated VM. Optimizations may be applied to minimize this overhead; for example, as part of live migration the hypervisor may acquire an address for the interface before the intermediate state is transferred to the destination subnet. Unless such optimizations are applied, the time required to acquire a new IP address (e.g., with a

DHCP request) would be greater than that of an RARP request to update the IP-to-MAC address mapping.

On the other hand, the latency between the client and a hypervisor implementing the isolation boundary does not incur any overhead. There is no increase in latency between the client and migrated server as the new IP address assigned to the network interface is owned by the destination subnet. Thus, communication is direct, unlike the other methods where traffic from the client is routed through the source subnet before it gets to the server in the destination subnet. This is what we demonstrate in the case study. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 141

VM’s Coupling with Previous State (/Subnet)

As noted above, tunneling, routing, and layer-2 expansion methods may be applied to extend a subnet to a different geographical location, but these methods create unnecessary coupling between the source and destination subnets. Such solutions do not work well where disaster recovery or failover management is the concern (i.e., when migrating VMs between sites).

If VMs are to be migrated to a different site for disaster management, we cannot assume that the forwarding elements at the source subnet would continue to assist after migration is complete.

With tunneling methods, communication from outside the subnet continues to arrive at the source subnet, which is then forwarded through the tunnel to the destination subnet. Herein lies the assumption that the source subnet would continue to assist even after the migration is complete.

There may be some optimizations, where the migrated VM’s network interface card (NIC) is assigned a new IP address while it maintains its old IP address until the old TCP connections are active. However, if the source subnet were unable to assist, such a solution would not work, at least for the network connections setup before migration.

Similarly, if an OpenFlow-based approach to the traffic is used, the controller configures the forwarding elements so that traffic is sent to the destination subnet. While the VM may be hosted in the destination subnet, the forwarding element at the source subnet continues to participate in the communication. Such behavior would not be acceptable when dealing with

VM migration for disaster management. The same is the case with layer-2 expansion methods.

Due to the use of the isolation boundary, there is no requirement that the source subnet partic- ipate after the migration. This is because the service, after migration, uses an IP address that belongs to the subnet where the VM migrated. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 142

Correctness

Using SSH enables us to validate correctness of the migration. With SSH any misalignment of bytes, lost segments or incorrect ordering would break the encrypted stream stream. As we are able to successfully use the SSH client application following a migration, we establish the fundamental correctness of the method and implementation. Unfettered network access is enabled, even after a change of IP address, because we were able to decouple the IP address

(network label) from the socket (transport label).

As seen in Figures 4.8, 4.9, and 4.10, the network configuration changes such that the VM is connected via different subnets before and after migration. The transport connection state also shows that, after migration, the old connection does not exist and a different port at the server is used to interact with the same port that was used earlier on the client — showing that the logical connection has been resumed over a new port at the server. We see the same configuration from the client’s perspective; the local setup remains the same, but the server port changes for the same TCP connection after migration.

In spite of these changes, the application continues to operate without a hitch. Indeed SSH continues to show the old address. In Figure 4.11, we see that the SSH_CONNECTION environ- ment variable is set when the connection was setup but later the application is oblivious to the change in network configuration. It appears that the SSH application does not make use of the information stored in the environment variable.

Pause-and-Copy Migration

Interruption in connectivity can also be a concern for pause-and-copy migrations. This is be- cause, if the VM migrates to a different subnet and is required to acquire new IP addresses, CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 143

em0, before migration: ether 08:00:27:49:75:00 inet 192.0.2.228 broadcast 192.0.2.255 em0, after migration: ether 08:00:27:49:75:00 inet 198.51.100.233 broadcast 198.51.100.255 Figure 4.8: Network configuration at the server, before and after VM migration.

connection state before migration: Proto Local Address Foreign Address (state) tcp4 e4.dhcp.v.ssh d8.dhcp.v.41270 ESTAB. tcp4 *.ssh *.* LISTEN connection state after migration: Proto Local Address Foreign Address (state) tcp4 e9.dhcp.v.48472 d8.dhcp.v.41270 ESTAB. tcp4 *.ssh *.* LISTEN Figure 4.9: TCP connection state at the server, before and after migration.

connection state before migration: Proto Local Address Foreign Address (state) tcp4 aramis.41270 *.* LISTEN tcp4 aramis.41270 e4.dhcp.v.ssh ESTAB. connection state after migration: Proto Local Address Foreign Address (state) tcp4 aramis.41270 *.* LISTEN tcp4 aramis.41270 e9.dhcp.v.48472 ESTAB. Figure 4.10: TCP connection state at the client, before and after migration. Note the listening socket on the same port is an artifact of our implementation.

application state before migration: >echo$SSH_CONNECTION 192.0.2.216 41270 192.0.2.228 22 application state after migration: >echo$SSH_CONNECTION 192.0.2.216 41270 192.0.2.228 22 Figure 4.11: Application state at the server, when logged from the client, before and after VM migration. Note that the environment variable is set at the time of connection setup and is oblivious to change in configuration. the existing connections that were paused would be discontinued. Solutions such as Dynamic

DNS [148] that update domain name to IP mapping cannot help as the services are paused, not stopped, for migration. Therefore, pause-and-copy migrations effectively involve the same CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 144 procedures. It is just that the time scales at which they happen are much larger as compared to live migration.

As we demonstrate in the case study, by using the isolation boundary, we can avoid all issues of coupling between transport and network labels and thereby enable a seamless pause-and-copy migration.

Backward Compatibility

In case a client does not support the isolation boundary, it is able to interact with the VM initially as the isolation boundary extension is backward compatible. However, if the VM migrated in such a case, the client would not be able to resume connectivity with the VM as it would not recognize the transport connection with a new IP address.

Deployment

As the isolation boundary is backward compatible, there is no requirement that all participating network elements be aware of the functionality. Those network stacks that implement the isola- tion boundary would benefit from the functionality, all others would fall back to legacy support.

However, entire subnets can benefit from the features by deploying isolation-boundary-aware gateways (e.g., NATs, load balancers). The nodes would only benefit from the support pro- vided by the isolation-boundary-aware gateway during the time they remain within the scope of the gateway. Nevertheless, such gateways may facilitate incremental adoption. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 145

Compatibility with Live Migration

Our proposed approach does not make any assumptions about procedures involved in migra- tion. There is no expectation from either the service hosted at the VM or the client application.

This allows live migration procedures to exist and operate independent of how the isolation boundary operates. To these processes, the isolation boundary is an extension. For our case study, we did not consider a live migration. However, the case of saving VM’s ephemeral state and resuming it after migration such that network connections remain valid involve the same technical challenges (process migration, storage management, etc.) except for the duration of the time the VM is inactive. Though live migration is not the focus of our study, with the straight forward optimization of copying the VM image in advance and then using rsync to copy the intermediate state we were able to reduce the inactive time to tens of seconds. Adopt- ing an implementation approach similar to Clark et al. [139] would reduce inactive times to milliseconds.

Middleboxes and TCP Options

Honda et al., in their paper [67] state that middleboxes today tend to either strip custom TCP options if they are part of the data stream or drop the packets altogether. However, if the custom options are used during the connection setup phase alone, then middleboxes tend to allow most of the traffic through. This finding is favorable to our approach where the isolation boundary options are only exchanged during connection setup phase (i.e., the 3-way handshake).

With our case study, we validate the hypothesis that if the the isolation boundary options are exchanged successfully, we will be able to enable uninterrupted communication following a VM migration beyond a network. However, if the options are removed, the network stack would fall back to legacy behavior and communication would take place until the VM is migrated CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 146 to a different subnet. At this point, communication would stop and a new connection would have to be setup between the client and the server, with the server’s new address. Unless the applications implement reliability such a discontinuation may require the application to reset.

Network Performance & Scalability

As the isolation boundary options only participate during connection setup or synchronization of state and not during the rest of the communication, there is no impact on the performance.

Further evaluation of performance of the isolation boundary is presented in later sections in this chapter.

For the network stack to scale in terms of managing large number of connections, the imple- mentation needs to be thread safe. The fact that the proposed method is only active during the 3-way handshake — for connection setup or synchronization — significantly reduces the demand for locks, which could otherwise inhibited performance.

Security Considerations

Our proposal does not introduce any security threat greater than that to which TCP is already exposed. A flow can only be hijacked if the TIFIDs can be guessed correctly along with the sequence numbers. As we use the same methods of initializing TI sequence numbers as is done with TCP’s sequence numbers, we do not introduce any risk greater than TCP.The response to an invalid request to synchronize a TI flow over a new TCP connection, with an invalid TIFID or TIAck, is a reset. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 147

4.4.6 Summary of Case Study

Here, we highlight that the migration of a virtual machine beyond a subnet is of significant

importance. Until now, this goal has only been realized through ad-hoc means.

We make a case that the use of IP addresses as part of flow labels — to identify the transport

connection — overloads the notion of network naming and that this is the fundamental reason

that inhibits a clean approach for VM migration beyond a subnet. We suggest that decoupling

the transport end point naming from IP addresses is not only possible, but is also efficient. We

establish this claim by demonstrating seamless VM migrations between different subnets such

that the application are oblivious to the migration.

4.5 Case Study Exemplifying TCP Extensions: Resilience

in the Presence of Middleboxes

A guiding principle in the design of the Internet has been that network communication is end-

to-end and that network intelligence should be as close to the resources on the edge as possi-

ble [68]. Because of this, the capacity of the Internet has scaled well with the rapid growth in the number of devices. As the Internet has grown and matured, however, it has been necessary

to introduce intelligent intermediate devices, such as firewalls or application gateways, hence

weakening the end-to-end nature of the network. Although these intermediate devices weaken

the notion of end-to-end connections, they are necessary for operational or functional reasons.

Nevertheless, intelligent intermediate devices reduce the network transparency for end hosts

and their applications, requiring the end hosts to make decisions that accommodate the inter-

mediate devices. With the proliferation of intelligent devices in the network, the likelihood CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 148 that communication will be interrupted has increased.

When we refer to middleboxes here, we are referring specifically to devices that record TCP state and use that state to affect future decisions. Many of the middleboxes mentioned in

RFC 3234 [44] fall into this category. When a middlebox fails, it loses the state it maintained on behalf of ongoing communication. Upon restoration of the middlebox, established TCP connections have no corresponding state and are rejected. Even if the fault is transient, the applications on the end host have no recourse but to try to re- establish communication or to pass the problem on to the user. Neither approach is entirely satisfactory. In the former case, each application needs to be written to handle middlebox failure. While burdensome, it is preferable to the latter case where responsibility for handling the failure is passed to the user who is not even aware of the existence of middleboxes.

While restarting applications causes an annoyance to some, for some industries this represents significant lost opportunity. The banking industry in particular must balance the concerns of resilience with security. As an element of security, banks deploy network firewalls. However, to protect against equipment failure, firewalls are deployed in state-sharing pairs. While possible when the firewalls are co-located, this does not protect against building or power failures.

In this case, physically diverse pairs are deployed, but these cannot share state due to the distance and reliability of the networks between. Compounding this problem is the fact that each entity has its own set of firewalls and application communication may involve multiple entities. Hence an application is vulnerable to any firewall pair within its communication context failing. Significant revenue can be lost during the time taken to restart applications.

But a greater risk to revenue is that clients move “to competitors at the first sign that [the] company’s infrastructure [is] down” [149].

Here, we examine the problem of recovering from transient middlebox failures and to the greatest extent possible insulating the application (and thereby the user) from these failures. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 149

This problem belongs to the general class of problems of providing higher-level end-to-end network services. We refer the reader to Section 4.5.4 for examples.

We evaluate our solution primarily on backward compatibility with legacy TCP and the prin- ciple that the protocol should increase functionality without decreasing performance. Users should not perceive any change in performance either in the connection establishment phase or during data transfer. We do not consider the impact of recovery from a transient middlebox failure on overall data transfer performance as this condition is an exception to normal opera- tion. However, we do examine the minimum time required to recover a broken connection.

Protocols already exist that would allow us to restart a TCP connection [150]. But instead of limiting our solution to a few problems, we introduce a mechanism discussed in earlier sections, referred to as the isolation boundary, that places a TCP connection in the context of a transport independent flow (TI-flow). This mechanism decouples the abstract flow from the underlying TCP, thereby making the solution applicable to other higher-level end-to-end network services. However, in this work we only examine the case of recovering from transient middlebox failures. The isolation boundary keeps track of where TCP is in the context of the

TI-flow so a new TCP connection can be created and communication restored in the event of a transient failure without the application — or more importantly, the user — becoming aware.

Most important of all, our approach maintains backward compatibility with existing devices, thus allowing incremental adoption.

4.5.1 Conceptual Design

It is well established that middleboxes today are an “Internet fact of life” [63], nevertheless it is also accepted that they break the end-to-end semantics assumed by typical applications. This is because the middleboxes maintain state and interact in the conversation, often transparently CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 150 to the end systems — some middleboxes terminate TCP transport flows from the sender and setup a fresh transport with destination, all the time acting as a relay while performing their job.

Such behavior introduces the challenge that should the middlebox fail, we lose connectivity between the end hosts, which is not acceptable since we lose the notion of end-to-end semantics assumed by the end hosts. While we acknowledge that middleboxes provide some benefits, the liability is that when they fail, connections being carried through them are also broken.

As we have explained in earlier sections, we address this challenge by developing the notion of an abstract flow, specifically a transport-independent flow (TI-flow) that represents the abstract communication between applications, independent of the underlying transport protocols. For this purpose, we use the isolation boundary, which allows us to maintain end-to-end semantics at an abstract level, without precluding middleboxes. To create an abstraction of TCP we must maintain TCP’s semantics of reliable, in-order delivery of data and we must maintain an identity independent of the addressing that TCP uses to identify a connection. We call the independent identifier a transport-independent flow identifier (TIFID).

In order to be backward compatible, the TI-flow capability must be negotiated out-of-band from TCP’s data stream. To have the least performance impact, maintenance of the TI-flow capability should also happen out-of-band from TCP’s data stream. We use TCP options to establish and maintain the placement of a TCP connection within the context of the TI-flow.

The isolation boundary leverages support from TCP by delegating the tasks of reliable, in- order delivery and the description of the sequence space. However, to implicitly maintain the end-to-end semantics for the flow abstraction, we define an abstract sequence space.5 The data in each TCP connection is then mapped into this sequence space as a part of placing the

TCP connection in the context of the transport-independent flow. An implementation could reuse TCP’s sequence space directly, however this would create additional coupling to TCP

5We assume that a conversation between endpoints may last longer than the life of the TCP flow. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 151 limiting the TI-flow as a general mechanism. Additionally, the reuse of state between TCP connections would make TCP more vulnerable to hijacking. All that is required is to describe how each TCP connection fits into the overall context of the TI-flow. It is sufficient to establish this mapping during TCP’s synchronization phase. Once the mapping is established, progress through TCP’s sequence space implies progress through the transport-independent sequence space. Note that the mechanism of defining the TI-flow does not imply maintaining distributed state. Since we delegate the task of maintaining distributed state to TCP and only synchronize the abstraction with TCP’s semantics at the time of setup, the overhead during setup is expected to be negligible.

Leveraging TCP options and keeping pace with TCP’s sequence space required us to implement the isolation boundary in the kernel, e.g., FreeBSD. With a user- library implementation, we would have needed to develop a mechanism to probe whether the communicating peer had support for such flow abstractions. Any probe protocol would ultimately have to resort to timeouts to infer the lack of support. Unlike the user-library implementation, the presence of custom options in the SYN+ACK message would indicate support, while absence would indicate lack of support, thus eliminating the need for any heuristics. The discovery of end-to- end support can happen alongside connection setup, thus allowing the kernel implementation to avoid probing and be backward compatible at the same time. Since the isolation boundary plays its role only when a new transport flow is setup and does not interfere with the critical data path, the overhead during transport is expected to be negligible.

Practical Details

With the logical construct defined above, we now study the details of creating an implementa- tion. A critical aspect that must be considered for a practical implementation is the amount of data required to convey the context of the TI-flow in the TCP option field. Another important CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 152 aspect is the question of security. The introduction of the flow option should not compromise

TCP’s security characteristics.

TCP options may be used during the TCP SYN phase to reliably exchange a unique flow identi-

fier, the TIFID, and the mapping between the transport-independent sequence space and TCP’s sequence space. We discuss the protocol for selecting the flow identifier and populating the transport-independent sequence numbers in later sections.

Before we further discuss the practical details, we need to acknowledge the constraints of using custom TCP options to exchange the transport-independent flow identifier and the transport- independent sequence numbers, e.g., space availability in the TCP header. When the TCP SYN

flag is set the following options need to be supported: maximum segment size (RFC793, four octets) [1], window scaling (RFC1323, three octets) [126], selective acknowledgment permit- ted (RFC2018, two octets) [127], and time stamp (RFC1323, ten octets) [126]. This leaves us with 21 octets, although most implementations will only leave 20 octets due to field alignment.

Because of this limited TCP option space, not all options can be supported simultaneously. (We have discussed these unsupported options earlier in the chapter.)

The size of the transport-independent sequence space should be at least as big as TCP’s se- quence space (32 bits). Any smaller would create problems in the mapping between the two spaces during TCP’s synchronization phase. Having more space allows the issue of TCP se- quence space wrap-around over high-speed links to be addressed as a future concern. Larger spaces for both the transport- independent sequence space and the TIFID will decrease the vulnerability of the TI-flow to session hijacking. Because the upper bound on sizes is dictated by the remaining space for TCP options, we chose the upper bound for each field, i.e., TIFID, sequence number, and acknowledgment number, to be 48 bits each for a total of 18 bytes.

Other than to note these bounds on fields sizes, we do not explore the problem of ideal size in any more depth in this work. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 153

4.5.2 Extending TCP

The TCP header, shown in Figure 4.5, consists of 20 octets for fields that must be present in all

TCP headers, followed by up to 40 octets of options. We extend TCP by creating a transport- independent flow option, which is only valid during connection setup. The option consists of three 48-bit fields, as shown, in addition to the option tag and the length.

The first field contains the transport-independent flow identifier (TIFID), which provides the layer of indirection needed for the isolation boundary by labeling a flow independent of the underlying transport. The TIFID is an opaque identifier that is unique within the context of the two end hosts. A straightforward way to specify a TIFID with the correct properties is for the requesting process to specify a locally unique value for the first half of the TIFID in the initial

SYN packet (TIFID1 through TIFID3) and the responding process to specify a locally unique value for the second half in the SYN-ACK packet (TIFID4 through TIFID6). As with TCP initial sequence numbers, both halves of the TIFID should be selected at random to guard against connection hijacking. Finally, the second half of the TIFID is zero during the time that the

TIFID is partially specified, i.e., in the SYN packet.

By itself, the TIFID is insufficient to allow resynchronization when the underlying transport fails. The missing information is the position within transport-independent flow. Thus, there are two additional fields in the flow option indicating the next byte to be sent, i.e., the transport- independent sequence number (TISeq), and the last byte received, i.e., the transport-independent acknowledgment number (TIAck). As with traditional TCP,the two end points select an initial

TISeq during the three-way handshake and each returns a TIAck to acknowledge the receipt of a SYN packet. Unlike TCP,the SYN bit does not need to be acknowledged because that is TCP’s responsibility. When not defined, such as during the first phase of the three- way handshake, a TIAck is zero. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 154

The TISeq are mapped onto the protocol-dependent sequence numbers of the underlying (TCP) transport and remain synchronized with them as long as the transport connection is active. The

TISeq progresses through its space in a manner that is consistent with the transport sequence number being incremented. Because of the implicit synchronization, there is no need to ex- plicitly send the TISeq and TIAck numbers after the three-way handshake.

Connection Establishment

Figure 4.3 shows a sequence diagram of connection establishment. The initiator of communi- cation, PeerA, defines the first half of the flow identifier, TIFIDA, and initializes the second half to zero. PeerA also selects a random initial TISeq number, TISeqA, and establishes a mapping between TISeqA and the initial TCP sequence number. (TIAckA is set to zero.) It then sends a

SYN packet with a flow option containing these values. PeerB defines the second half of the

TIFID, TIFIDB, using a random value and selects a random initial TISeq number, TISeqB. It acknowledges receipt of the SYN packet by setting TIAckB =TISeqA. It then sends a SYN+ACK packet with a flow option containing these values. Upon receipt of the reply, PeerA notes the value of the completed TIFID, which uniquely identifies the flow. It then returns an ACK packet containing the completed TIFID, its established TISeqA, and a TIAckA =TISeqB acknowledging the SYN+ACK packet as the final phase of the three-way handshake. Finally, PeerB validates that its SYN packet was received by checking TIAckA. At this point, transport-independent flows in each direction have been established, along with the associated bidirectional TCP con- nections. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 155

Connection Re-establishment

The failure of a middlebox in the network causes the logical end-to-end connections passing through it to also fail. Even though the connection failed, the isolation boundary maintains the position within the application data streams in each direction, via TISeq and TIAck, so that the transport can resume in the correct place once a new TCP connection is established.

The procedure for resuming operation after disconnection is the same as for creating a new connection except the previously completed TIFID, signified by a second half not equal to zero, is used instead. Since the TIFID is already complete, the receiving stack looks up the isolation boundary information corresponding to the complete TIFID and creates a new TCP connection upon which to resume the sending of application data. The exchange of SYN and SYN+ACK packets in this case allows the stacks to re-synchronize where they left off at the time of the disconnection by exchanging the TISeq and TIAck numbers. The peers also use the TISeq and

TIAck numbers to establish new mappings from the old transport-independent sequence space to the new transport-dependent sequence space.

Backward Compatibility

End hosts advertise that they implement the isolation boundary by specifying the flow option in a TCP header. If both hosts specify the flow option, then the functionality of an isolation boundary is enabled. If either host is unable to support the isolation boundary for any reason, they will not supply the flow option during connection establishment, and hence, both will continue to connect without the isolation boundary, thus maintaining backward compatibility.

Even when some overzealous middleboxes strip off unknown TCP options, compatibility is still maintained because hosts that do not implement the isolation boundary will behave the same CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 156 as before while hosts that do implement the isolation boundary will be led to believe that the other host does not and hence fall back to legacy behavior.

In this way, backward compatibility is maintained, and there is no requirement that all hosts be updated simultaneously.

4.5.3 Implementation and Evaluation

We implemented the isolation boundary in the FreeBSD 8.1 kernel. The kernel implementation for the isolation boundary was contributed by Eric J. Brown, Virginia Tech. A summary of the changes follows: 1,156 lines added in 55 locations, 58 lines deleted in 34 locations, and 435 lines modified in 42 locations. A total of 1,649 out of 237,410 (0.7%) lines were touched in the network stack, representing 131 locations in 12 out of 122 files (9.8%).

Our test environment consists of three Dell PE2650 servers running FreeBSD 8.1. The servers each have dual Intel Xeon SMT processors with a frequency of at least 2.0 GHz and hyper- threading turned on. They also have 4 GB of DDR2 RAM and a bus speed of 533 MHz and are connected by 1 Gbps Ethernet. The average throughput measured with iperf is 940 Mbps for the legacy TCP stack. Two of the servers are configured as a client and a server. The third is configured as a WAN emulator using Dummynet [103].

We now turn our attention to evaluating the backward compatibility, correctness, and perfor- mance of the implementation.

Backward Compatibility and Correctness

There are two cases to consider in ensuring backward compatibility with legacy TCP: a modified sender connecting to an unmodified receiver and an unmodified sender connecting with a CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 157 modified receiver. We use SSH as an example and show traces of connection establishment using tcpdump, which has been modified to display the new TCP option.

Trace 1 shows the three-way handshake during connection establishment between a sender that wants to establish an isolation boundary and a receiver that does not. The flow option is displayed in bold face. The SYN packet uses the flow option to convey a partial TIFID and an initial TISeq. The TIAck is zero as there is nothing to acknowledge yet. The SYN+ACK packet does not contain a flow option, since the receiver does not implement the isolation boundary or is unable set one up at this time. As a result, the sender does not set the flow option in the

ACK packet and both hosts communicate without an isolation boundary.

Trace 1 Sender Implements the Isolation Layer Packet 1: IP 192.168.1.2.4874 > 192.168.2.4.ssh: Flags[S],seq 100,win 65535,options[mss 1460,nop,wscale 3,sackOK,TS val 787 ecr 0, flow-d tifid 4b2209000000 tiseq 00000000001f tiack 000000000000],len 0 Packet 2: IP 192.168.2.4.ssh > 192.168.1.2.4874: Flags[S.],seq 200,ack 101,win 65535, options[mss 1460, nop, wscale 3, sackOK, TS val 197 ecr 787],len 0 Packet 3: IP 192.168.1.2.4874 > 192.168.2.4.ssh: Flags[.],ack 1,win 8326, options[nop, nop, TS val 788 ecr 197],len 0

In the case of an unmodified sender talking to a modified receiver, the receiver is made aware that the sender does not implement (or has decided not to set up) the isolation boundary when it receives the SYN packet without the flow option being set. Therefore, it does not set the flow option in the SYN+ACK packet it sends, and the connection proceeds without the isolation boundary. We omit the trace for this case as it is exactly the same as legacy TCP.

Now that backward compatibility has been demonstrated, we show the case where both the sender and the receiver implement the isolation boundary. As Trace 2 shows, the sender sets the

flow option in the SYN packet. The receiver replies with a SYN+ACK packet containing a flow option with a complete TIFID, its initial TISeq, and the appropriate TIAck. Upon receipt of the CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 158

Trace 2 Both Implement the Isolation Layer Packet 1: IP 192.168.1.2.11305 > 192.168.2.4.ssh: Flags[S],seq 110,win 65535,options[mss 1460,nop,wscale 3,sackOK,TS val 109 ecr 0, flow-d tifid 0a4530000000 tiseq 00000000001d tiack 000000000000],len 0 Packet 2: IP 192.168.2.4.ssh > 192.168.1.2.11305: Flags[S.],seq 456,ack 111,win 65535, options[mss 1460,nop,wscale 3,sackOK,TS val 138 ecr 109, flow-d tifid 0a4530be79bf tiseq 00000000001f tiack 00000000001d],len 0 Packet 3: IP 192.168.1.2.11305 > 192.168.2.4.ssh: Flags[.],ack 1,win 8326,options[nop,nop,TS val 110 ecr 138,flow-d tifid 0a4530be79bf tiseq 00000000001d tiack 00000000001],len 0 reply, the sender knows that the receiver wants to utilize an isolation boundary so it sends an

ACK packet with the flow option filled out to confirm that it received the option correctly. This protects the isolation boundary capability in the presence of a middlebox that strips options in one direction and not the other. The connection proceeds utilizing the isolation boundary. The

TCP flow option is no longer included in any other packets of the connection.

Overhead Incurred by the Isolation Boundary

Because the flow option added to TCP to support the isolation boundary is only transmitted on the wire during connection setup, i.e., during the three-way handshake, we expect any additional overhead would be most observable as an increase in setup time during that phase.

(During the operation of the connection, there is a small amount of processing needed to keep the TISeq and TIAck in synchronization with their TCP counterparts, but the effect is minimal as we will show.)

Overhead During Establishment The instructions added to the kernel increase the time it takes to establish a TCP connection. The amount of additional work that is done is small, so the effect should also be small. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 159

There are two challenges in measuring the time it takes to establish a TCP connection so that a comparison can be made. First, precisely measuring the overhead requires kernel instrumen- tation, which is tedious to set up and has the potential to perturb the normal operation of the kernel, thereby obscuring the values being measured. Second, the overhead occurs on both the initiator of communication and the responder. Measuring elapsed time in a distributed control system, such as TCP,is further complicated by the fact that the clocks on the end hosts are in only loosely synchronized when compared with the magnitude of the values being measured.

The first concern is addressed by measuring times in user-space code under the assumption that (on average) both the extended and legacy TCP stacks should see the same perturbations from unrelated processing on the hosts. This assumption is supported by the low variance seen on repeated measurements.

The second concern is addressed by taking both time stamps on the same host. A simple client and server are used to create a connection. A time stamp is place in the client code just before the call to connect whereupon the client immediately blocks on recv. The server immediately closes the connection upon returning from accept causing the recv on the client to return without reading any data. The client then closes its socket. Except for the processing that occurs on the final FIN packet from the client to the server, acquiring time stamps in these locations brackets all the additional processing that is done on both the client and the server in the isolation boundary implementation. Subtracting the elapsed time for establishing a connection with extended TCP stack from the elapsed time for the legacy stack gives the overhead. The average overhead was computed over a large enough number of runs that the half width of the 95% confidence interval is below 5%.

The average time between time stamps without the isolation boundary mechanism is 1.168 0.054 msec ± while the average with the mechanism is 1.148 0.045 msec. Based upon the overlapping of ± the confidence intervals, we conclude that the increase in overhead for connection establish- CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 160 ment is negligible.

Overhead During Data Transfers For each packet received, the TISeq and TIAck values must be updated to advance at the same rate as the corresponding TCP variable to which they are logically bound. This adds a small amount of code that is executed only when the isolation boundary is in use. We quantify the cost of the additional code by comparing the time it takes to transfer an amount of data using iperf both with and without the isolation boundary. The first set of results is run between two real hosts on a gigabit Ethernet. The average data bandwidth with a generic kernel was 940.3 1.4 Mbps while the bandwidth with the modified kernel was ± 940.6 0.2 Mbps. In both cases the TCP traffic saturated the network so it is impossible to tell ± if the isolation boundary decreased performance of the TCP connection since TCP processing was not the bottleneck.

In order to explore TCP processing as the bottleneck we configured iperf to use the loopback interface. In order mitigate memory BW as the bottleneck we reduced the MTU such that

TCP chose 2048 bytes as the maximum segment size. This forced a higher packet rate such that per packet processing became the bottleneck. Under these conditions the generic kernel acheived a bandwidth of 916.4 1.9 Mbps and the modified kernel acheived a bandwidth of ± 915.1 2.2 Mbps. All results were computed at the 95% confidence level. ± While the isolation boundary may incur a small performance cost we were unable to verify this with statistical significance. In all but extreme cases, TCP processing is not the bottleneck and users will not perceive any degradation in TCP performance when the isolation boundary is in use. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 161

Time to Reconnect

Although the isolation boundary enables automatic reconnection when a middlebox loses state, reconnection must be fast enough to remain transparent to the user or application. As above, there is the issue of finding a common time base in a distributed system in which to estimate the reconnect time. Unlike measuring the connection overhead, we cannot use the client/server approach to bound the time since the application is unaware that its TCP connection has failed.6

Instead, we use the tcpdump time stamps on the SYN and ACK packets of the three-way hand- shake in a trace taken on the originating host as our sources of time. The difference between the two time stamps is an estimate of the reconnection time. We also measure the setup time for a legacy connection as a baseline.

Figure 4.12: Time for client to reconnect vs. round-trip time.

As expected, because reconnection causes a packet exchange, the time for a client to reconnect

6We simulate failure by accessing a custom sysctl variable that calls a kernel function to disconnect the “failed” TCP connection and re-synchronize the TI-flow with a new TCP connection. CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 162 is a function of the round-trip time (RTT), as evidenced in Figure 4.12. It takes about one RTT for the initiator of the reconnection to reestablish the connection and one and a half RTT for the receiver to do the same. Because we are measuring time from the perspective of the sender, the reconnect time should be less than the 1.5 RTT needed to establish a TCP connection on ⇤ the receiver. Except for the highest RTT measured, the reconnection time of the extended

TCP stack is indistinguishable from the connection setup time of the legacy stack. For an RTT of 95.75 0.11 ms, as shown in the inset, the reconnection time is 96.26 0.13 ms while the ± ± connect time is 95.80 0.16 msec. The difference is significant at the 95% confidence level. ± For most applications, a difference of 0.46 ms should be acceptable. We have yet to optimize the code and expect that it can be further reduced.

4.5.4 Alternate Methods to Manage Communications Involving Middle-

boxes

Although middleboxes violate the end-to-end principle, they are accepted for the features they provide. Consequently we see conscious effort by the networking community to facilitate de- ployment of middleboxes, in a manner that retains their benefits while minimizing their draw- backs [151, 152].

At present, two approaches exist to mitigate the challenges introduced by middleboxes: (1) explicit control of the middleboxes (e.g., middlebox communication (MIDCOM) [42, 44], and

IETF Next Steps in Signaling (NSIS) [61]); and (2) traversing the middleboxes (i.e., without any control relationship between the end host and the middlebox as is in case of IETF Session

Traversal Utilities for NAT (STUN) [64], Traversal Using Relays around NAT (TURN) [153], and Interactive Connectivity Establishment (ICE) [154]).

We do not argue that the methods developed to maintain end-to-end semantics without pre- CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 163 cluding middleboxes are better or lacking. Instead, we present the case of the Isolation Bound- ary, which enables end hosts to establish an abstract concept of application streams independ- ent of underlying transport. Such a mechanism allows us to accept interactions with middle- boxes all the while strengthening the end-to-end nature of the communication. Below we dis- cuss select research towards maintaining end-to-end semantics while accepting middleboxes.

Network researchers have been studying interaction of end-hosts with middleboxes [42, 44]. Here, the authors argue that middleboxes should be application agnostic — i.e., they should not be required to maintain application intelligence to assist to the fullest. For this reason, they propose an architecture and a framework to allow trusted entities — referred to as MIDCOM agents — to assist middleboxes in meeting their objectives without incorporating application intelligence in the middleboxes. The MIDCOM agents may reside on end-hosts, proxies or application gateways depending upon the circumstances. In contrast to MIDCOM the isolation boundary establishes a higher level concept — a transport independent flow — which allows us to maintain end-to-end semantics despite the presence of middleboxes.

Snoeren et al. [150] propose TCP Migrate that maintains end-to-end semantics across IP address changes for mobile clients. Here an abstract token is used to identify the stream, independent of the network attachment point. However, TCP’s sequence space is used to describe the stream.

This dependence on TCP’s sequence space to define the stream, not only creates strong coupling with TCP, but also introduces a possibility of hijacking the communication (which has been acknowledged by the authors). Our abstract sequence space definition does not make the proposal any more vulnerable to hijacking than TCP is today. Defining an abstract sequence space with the isolation boundary, allows us to describe the communication independent of the underlying transport. This allows us to map the application stream to one or more transport streams, which is not possible with TCP Migrate.

Sultan et al. [155] propose M-TCP, a connection migration solution which deals with intermit- CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 164

tent connectivity due to failures at the server end. The proposal is to deploy clones of services

at different locations in the network, which can exchange communication state as need be.

The client initiates a connection migration request from one service instance to the other; the

protocol stacks then cooperatively exchange communication state while maintaining end-to-

end semantics. The isolation boundary, however, is not limited in scope as M-TCP; either the

server or the client can trigger the reconnection (of a disconnected TCP connection, on the

same node potentially with a different IP address). Although with the current implementa-

tion, the isolation boundary does not implement migration of state to different processes, the

isolation boundary does not preclude such mechanisms. In fact we argue that the isolation

boundary enables such “high layer services” without compromising backward compatibility.

4.5.5 Summary of Case Study

Here, we have laid out further arguments for establishing an isolation boundary in TCP that

allows us to restore end-to-end resilience in the presence of middleboxes while maintaining full

backward compatibility with legacy TCP.Our realization of the isolation boundary introduces

little overhead, and in most cases, it is difficult to observe the performance difference.

4.6 Summary

Here, we have laid out an argument for establishing an Isolation Boundary for TCP that main-

tains backwards compatibility.

Note that the specification of the control channel protocol will be pursued as part of the post-

preliminary proposed work, as presented in Chapter 5. We feel that a compliant stack that

implements the Isolation Boundary must admit the possibility of a control channel and properly CHAPTER 4. ENABLING EXTENSIONS TO THE NETWORK STACK 165 negotiate a data only channel in addition to implementing the control channel itself. We have claimed that given an Isolation Boundary, protocol designers will be able to construct higher level functionality on top of TCP.As a proof of this claim, we will at least need to create a mock implementation of the control channel and construct a higher level functionality that puts the features to effective use. Chapter 5

Enabling New Communications Paradigms

In Chapter 3, we presented a session-based communication model that can be used to describe modern communications. We explained how these abstractions may be used to describe differ- ent communication patterns. In Chapter 4, we explained how a framework for the proposed model may be implemented in a backwards compatible manner, such that we build upon the legacy of TCP and the existing Internet. We also presented extensions, which substantiate our claims that incremental evolution of TCP is possible and effective.

The session, flow and end-point abstractions allow us to define the constructs that together form a conversation. Their associated primitives allow us to setup and manage these con- structs. The control channel, used by the session layer provides a (practically infinite) space to exchange control messages between the communicating stacks. Together, the abstractions and the control channel form the means to enable configuration — setup of communication and reconfiguration — adaptation based on the context of communication.

In this chapter, we propose that a control-signaling space opens the door to multitude of ex- tensions for existing communications. Doing so also enables communication paradigms that

166 CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 167

were not possible before.

Here, we discuss the case of considering middleboxes as first-class citizens in the network.

This is a topic that has been the subject of considerable debate and there have been widely

varying opinions. For instance, one view states that the benefits of having middleboxes as first-

class citizens far out-weigh their drawbacks [63]. Conversely, another opinion states that we shouldn’t be using middleboxes at all and instead, their functionality can be implemented as

part of the end-hosts stack [156], while keeping the network simple.

Since the view to eliminating middleboxes seems to be giving way to the pragmatics of de-

ploying middleboxes for specific functionality in real networks, we propose the use of SLIM’s

control channel to engage middleboxes and enable richer and more robust communications.

Below we explain why and describe how explicitly engaging middleboxes to setup and manage

conversations results in enabling modern use cases.

5.1 Middleboxes Inferring Application State versus Being

First-Class Citizens

Middleboxes are now ubiquitous and essential elements in the network infrastructure. They

not only act as well-known services, such as Firewalls and Network Address Translators (NATs),

but also as entities that optimize applications in unique scenarios. For example, the Juniper

Networks’ WX Series Application Acceleration [157] middleboxes optimize network resource

use over WANs and the Cisco Catalyst 6500 Series SSL Services Module [158] assists by of- floading computationally intensive tasks of the SSL protocol to the middlebox.

Typically, we see that the middleboxes require an understanding of the application semantics

and that this application intelligence is strongly coupled with the middlebox implementation CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 168

— e.g., for application acceleration, the middlebox needs to understand the application logic and protocols to determine what, how and when to optimize communication; similarly, with the SSL offloading, the middlebox needs to understand the protocol, maintain state and has to manage necessary keys.

Requiring middleboxes to be transparent to the end-to-end communication forces them (some- times poorly) to infer details about state from ongoing communication. Instead, making mid- dleboxes first-class citizens in the network enables endpoints to explicitly share their intent and engage the middleboxes in setting up communications.

5.1.1 Examples of Interaction with Middleboxes

Consider for example a firewall. In a typical network, the firewall opens as many ports as there are services behind it — if there are a web and an SSH server behind the firewall that need to be accessible, the firewall would typically need to open ports 80 and 22. If a new service is added to the network, the firewall would need to open yet another port for it to make it accessible1. In these cases, the firewalls must be configured to accommodate the behavior of all the entities it is servicing.

As we mentioned earlier, the above behavior happens as the endpoints are oblivious to the presence of a firewall. If however, the endpoints were to engage with the firewall directly to request for access, the firewalls would have no open ports until access is requested (and per- mission granted by policy). Also, with the traditional end-to-end model, the firewall would be expected to understand the application behavior and setup configuration accordingly. With explicit interaction, the endpoints may declare their requirements and the firewalls may grant access accordingly, relieving the firewall from the burden of being able to understand applica-

1For a deployment with many services, the firewall eventually turns into swiss cheese. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 169

tion logic.

Explicit interaction with middleboxes also helps with cases where the middleboxes act as prox-

ies for endpoint services. Consider for example the middleboxes that serve as SSL offload

engines for web services. These offload engines act as proxies for the web services as they

terminate SSL connections from the clients and setup up unencrypted connections to the web

server. For large scale deployments, these offload engines may become the single point of

failure, although they relieve the web server from the computationally intensive tasks of main-

taining encrypted communication. If the clients were able to engage the middleboxes and

conversations could be described in the form of a session with more than two participants (as

we’ve shown in Chapter 3), the offload engines would authenticate a session using public-key

cryptography. Later the web service and the client will setup direct communication, encrypted

with symmetric keys.

5.2 Explicit Interaction with Middleboxes

It is evident from the discussion above that there needs to be a protocol for the interaction

of endpoints with middleboxes. A protocol that enables explicit interaction with middleboxes

must be general enough to service a reasonable variety of use cases. However, we appreciate

that there can’t be a “one size fits all” protocol that meets the needs of all existing middleboxes

and the technologies that are yet to come.

In order to understand the nature of interactions with middleboxes, we focused our study on

firewalls and network address translators (NATs). We recognize that these two do not reflect

the entire spectrum of middleboxes (e.g., load balancers, captive portals, application acceler-

ators) and therefore acknowledge that an attempt to create a single protocol that services all

interactions with middleboxes will result in an incomplete proposal. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 170

We concluded that we can define a generic workflow of interactions with middleboxes, classify interactions with middleboxes based on the types of communications, and define a generic template for the messages exchanged. However, the verbs (i.e., control messages, as we define in Chapter 3) and their semantics cannot be generic, in that they will be specific to the service that the middlebox provides. For example, the verbs and their semantics meant for a firewall service would not encompass the vocabulary needed for an application accelerator.

5.2.1 Generic Interactions with Middleboxes

As we have indicated above, explicit interaction with middleboxes will primarily facilitate com- munications’ setup or reconfiguration. We also explained in Chapter 3 that SLIM is primarily involved with session management and this involves session setup or reconfiguration. Thus, it is evident that the workflow of communications setup would include interactions with middle- boxes during setup or reconfiguration and that SLIM would use its control channel to exchange verbs during these phases.

Figure 5.1 illustrates, in generic terms, the entities hosted by a middlebox in relation to explicit interactions with endpoints. The application data streams would flow through the middlebox service elements during conversations, whereas during communications’ setup, SLIM verbs may be used by the endpoint stack to facilitate configuration through the middlebox. SLIM uses the middlebox-service interface to enact configurations triggered by verbs. In this context, an access control manager and a policy engine are elements that represent access control (i.e., authentication and authorization) and policy enforcement respectively. In the scope of our work, we assume the presence of elements that define and enforce policy and access control as well as an interface for engaging middlebox services. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 171

SLIM Context Manager App. verbs Policy Access Control middlebox service interface SLIM Engine Manager data TCP Captive Firewall … NAT Portal

Endpoint Middlebox

Figure 5.1: Entities hosted by middleboxes in relation to explicit interactions with endpoints. An ex- ample of SLIM’s context manager engaging the firewall service is highlighted by the dotted arrow. (The middlebox may host one or more middlebox services.)

5.2.2 Key Insight

Our key insight about generic interactions with middleboxes is that these communications must be in the context of the desired communication goals and not be specific to particular middlebox services.

To illustrate, consider the example of a firewall. When an endpoint interacts with a middlebox, instead of requesting that a specific port be opened, the endpoint indicates to the middlebox that it wants to communicate with a particular service behind the firewall. This highlights that the decision of "what needs to be done to setup communications" is left to the firewall, instead of being made by the endpoint requesting communications. This is important because the actions that would need to be taken depend not only on the middlebox service, but also on the communication endpoints behind the middlebox.

Using such an approach also enables a smoother path for evolution; for example, initially requests from endpoints may be serviced by implementing a simple firewall policy and later they may be realized with a different implementation using a complex access and authorization policy, all without requiring the endpoints to change. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 172

5.2.3 Typical Workflow

A typical (happy path) workflow of interactions with middleboxes would include the endpoint:

1. Setting up a control channel with the middlebox,

2. Dispatching a verb to the SLIM context manager,

3. The context manager enacting the verb by interfacing with the middlebox service, and

4. The context manager responding to the verb if need be.

We assume that the elements enforcing policy and access control shepherd the interactions within the middlebox.

5.2.4 Classification of Messages

In Figure 5.2 we summarize the classification of generic types of verbs exchanged with mid- dleboxes.

Types of Interactions

Acknowledged Unacknowledged

Information Request Configuration-Change / Exchange Request Notification

Figure 5.2: Classification of interactions between endpoints and middleboxes

We classify the interactions with middleboxes as either being acknowledged or unacknowl- edged. Acknowledged interactions include information requests and configuration-change re- quests. It is evident that information requests from an endpoint would result in a response CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 173 from the middlebox. In case of configuration changes, the responses may include error codes or configuration state. On the other hand, unacknowledged interactions encompass notifica- tions exchanged between stacks.

5.2.5 Verb Templates

A generic verb template is illustrated in Listing 5.1 and includes seven required records. These records are described below.

VERB: A unique name of the verb is used to establish the semantics of the message and subse-

quent expectations. The names are to be chosen by the teams defining (and potentially

implementing) the semantics.

SOURCE: The source label identifies the endpoint that dispatches the verb.

TRANSACTION_ID: The transaction ID enables ordering of verbs exchanged between end-

points and helps with resolving conflicts in a distributed setting where messages may

appear as duplicate, even though they are not.

TIMEOUT: Timeout serves as an expiration timer where a message may face delays in being

serviced and may not be relevant beyond a certain time. It also serves as an indication

for the message’s time to live.

SESSION_LABEL: The session identifier uniquely identifies to which conversation the verb

belongs.

AUTH_TOKEN: The authentication and authorization token serves as an attribute for the

source endpoint requesting the verb to be serviced. Once authenticated and authorized,

the endpoint uses the token for access and avoids repetition of access control procedures. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 174

{ "VERB" : "name", "SOURCE" : "endpoint_identifier", "TRANSACTION_ID": "identifier_enabling_ordering", "TIMEOUT" : "expiration_timer", "SESSION_LABEL" : "session_identifier", "AUTH_TOKEN": "cryptographic_hash_enabling_access_control", "PAYLOAD" : { "list" : [{ "VERB_SPECIFIC_CLAUSE_1": "payload_1", "VERB_SPECIFIC_CLAUSE_2": "payload_2", ..., "VERB_SPECIFIC_CLAUSE_N": "payload_n" }] } } Listing 5.1: The generic template of a verb, represented in JSON format. As we explain later, we chose to implement verbs in JSON format to facilitate debugging and troubleshooting of interactions.

We do not mandate the implementation of the token as a variety of approaches can be

adopted. For example, to ensure message integrity, an implementation may include the

contents of the verb when creating the authentication token, while other implementa-

tions may include a signed cryptographic hash of the verb contents as part of the payload.

PAYLOAD: The payload includes verb-specific clauses that enable the servicing of the request.

For example, a verb exchanged with a captive portal middlebox may request to be placed

in a particular tier for network access.

5.2.6 Towards Incremental Adoption

Explicit interaction with middleboxes need not be mandated for communications’ setup; legacy communications may continue to operate while assuming transparency of middleboxes. Our approach is towards optional interactions with middleboxes such that they enable richer and CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 175

robust communications. Doing so allows us to accommodate incremental adoption all the

while allowing conventional communications to continue.

5.3 Interactions with Firewalls

A generic discussion about explicit interactions with middleboxes gives us a holistic view of

what to expect when engaging middleboxes. However, as we explain in Section 5.2, it is

not possible to devise a one-size-fits-all protocol such that it covers all possible interactions

with a variety of middleboxes. Therefore, we conclude that a suitable way to approach this

challenge is to devise a generic protocol that defines generic interactions with middleboxes.

Subsequently, we may use this generic protocol to develop solutions specific to a class of mid-

dleboxes. Thus, here we discuss the case of firewalls and the possible interactions that they

may have with endpoints.

5.3.1 Design Considerations

Below are the design considerations that were taken into account while conducting this re-

search:

Interface

There are variety of firewall implementations that exist today and each approach has their own

strengths and weaknesses. However, our consideration in this research is not the implementa-

tion of firewalls, but instead interactions with firewalls while considering them as black boxes.

It is for this reason that we chose the Firewalld interface [159], which is defined as part of the Fedora Project. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 176

Dynamism

Explicit interactions with middleboxes suggests that firewalls should be able to change their configurations on the fly to accommodate access requests, and do so in a way that does not disrupt existing state of communications.

Some firewall implementations are susceptible to losing communication state upon changing

firewall configuration. This is because the design assumes that firewall configurations would remain largely static and that a change in configuration may be safely followed by a service restart. However, contemporary implementations allow for change of firewall configurations and do not require a service restart. This allows for configuration changes without disrupting ongoing communications that meet the access-control criteria.

Concepts

When describing an endpoint’s communication goals, the following concepts may be used to describe the intended service, its configuration, and potential means of access. The agreed upon vocabulary also allows us to describe the actions that the firewall may take to act upon the request. We’ve partially adopted this vocabulary from Firewalld [159]. These concepts are partially illustrated in Figure 5.3.

Service: A service is defined as an endpoint providing a facility to interested clients. The

service sits behind the firewall and its description includes its configuration details. For

example, an endpoint representing a web application and accessible through port 8080

would be classified as a service.

Zone: A zone is a collection of configurations typically applied to a network attachment point

(i.e., network interface). The configuration of a zone defines how a firewall will treat CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 177

traffic. Generally, network interfaces are described as being a member of a particular

zone at a point in time. The configuration also describes the services that are accessible

in the zone.

Source: In contrast with a service, a source defines the client endpoint that intends to com-

municate with the service.

Service

Firewall Internet

Public

DMZ Trusted

Source Zones

Figure 5.3: A partial representation of firewall-related concepts. (We state partial because a source may also be listed as a network interface and not just an endpoint. Doing so encompasses all traffic flowing from the said interface.)

5.3.2 Typical Workflow of Explicit Interactions with Firewalls

As we explain in Section 5.2.2, it is our conclusion that a suitable design decision for inter- actions with middleboxes would involve describing requests in the context of communication goals. This is in sharp contrast to defining specific requests to the middlebox. In other words, it is a better design decision for an endpoint to communicate to the middlebox that it needs to converse with a particular service behind the firewall, instead of the endpoint determining the configuration of the service and then requesting the firewall to open ports accordingly. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 178

This is significant because each deployment of a middlebox, an instance of a service, and asso- ciated networking configurations may require unique configuration changes to enable access.

Thus, enabling access may not be as simple as opening a specific port to allow traffic through.

Also, there may be a cascade of changes that may be required to enable access, when a config- uration isn’t already in place.

In this context, to illustrate a typical workflow of interactions with middleboxes, consider Fig- ure 5.4. Assume that a web service is to be setup behind the firewall. To allow potential clients to access the web service, the service would have to register with the firewall. This process may involve a service label, using methods such as [160, 161] (or registering a port for the serviceInteraction)with) which will be known by legacy applications).Middleboxes The firewall may) then provision access and resources, in accordance with the policy implemented by the middlebox. Authentication and authorization checks for the service endpoint would be performed at this stage.

2-–-provision- 1-–-register-service-

3-–-request-service-

4-e-allocate- 5-–access-service-

7-e-deallocate- 6-–-conclude-access-

8-–-revoke-service-

9-–-tear-down-

Alice) Firewall) Bob)

Figure 5.4: An example of explicit interaction between end-points and middleboxes 11-

Should a client or endpoint request access from the firewall to the service usingsynergy.cs.vt.edu- the service label (or port), the firewall may instantiate configurations to allow access. The firewall may also apply access policies at this stage. As with the service endpoint, authentication and autho- CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 179 rization checks may be applied at this stage. Also, the provisioning of resources may be done indefinitely or with expiration limits, as defined by policy.

Upon conclusion of the conversation between the endpoint and the service, the client may notify the firewall followed by which the firewall would reclaim allocated resources. Resources may also be reclaimed if the reservations expire without being renewed.

Should a service want to stop operations, it may revoke its registration with the firewall, upon which the firewall would deprovision access to that service.

5.3.3 Protocol and Semantics

To explain the protocol between endpoints and middleboxes, we classify the actions (taken as part of the workflow) into two groups. The first being the verbs exchanged between the endpoints and the middleboxes, and the second being the subsequent steps taken in order to implement the requests.

Verbs Exchanged between Endpoints and Middleboxes

The verbs exchanged are of two types: those that originate from the service behind the firewall and those that originate from the client requesting access to the service through the firewall.

The verbs that are initiated from the service are the register_service and revoke_- service verbs. Examples of these are listed in Listing 5.2 and 5.3. In addition to including the mandatory fields that are required by all verbs, the payload includes information that the

firewall will require to implement rules for maintaining access to the service. These include service labels to uniquely identify the service, endpoint labels to determine the communication endpoint if SLIM were used for communications, or interface address (IP) and port numbers if CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 180

{ "VERB" : "register_service", "SOURCE" : "bob.config", "TRANSACTION_ID": "1873", "TIMEOUT" : "40", "SESSION_LABEL" : "config.org", "AUTH_TOKEN": "fb59188239b623b5a2314084b4ac2d0c", "PAYLOAD" : { "list" : [{ "SERVICE_LABEL": "HTTP", "ENDPOINT_LABEL": "bob.meetup", "CURRENT_INTERFACE": "10.10.1.54", "SERVICE_PORT": "9786", "POLICY_CLASSIFIER": "internal_traffic_only" }] } } Listing 5.2: The register_service verb. legacy TCP was being used2. The policy classifier determines how the request is to be serviced.

As we mentioned earlier, we assume the existence of a policy engine (which falls outside of the scope of our current work). Note that the authentication token for the service endpoint is used as a delegate to authenticate and authorize the request. In case of the revoke request, the policy classifier determines how the request is to be implemented: lazily in this case — to allow ongoing communications to gracefully terminate. Also, as discussed earlier, these verbs fall un- der acknowledged communication and are followed by a response with error codes explaining whether the requests were successful or not.

Note that all requests by the verbs are in terms of communication goals. They do not direct the firewall as to how they are to be implemented, or what rules need to be included in the configuration.

2Note that the configuration does not preclude the scenario where the firewall has to be deployed to allow legacy traffic. In such a case, service may continue to use SLIM, while the client may be using legacy TCP; such a scenario would work since SLIM’s implementation is backwards-compatible with legacy TCP CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 181

{ "VERB" : "revoke_service", "SOURCE" : "bob.config", "TRANSACTION_ID": "1874", "TIMEOUT" : "40", "SESSION_LABEL" : "config.org", "AUTH_TOKEN": "fb59188239b623b5a2314084b4ac2d0c", "PAYLOAD" : { "list" : [{ "SERVICE_LABEL": "HTTP", "ENDPOINT_LABEL": "bob.meetup", "CURRENT_INTERFACE": "10.10.1.54", "SERVICE_PORT": "9786", "POLICY_CLASSIFIER": "lazy_enforement" }] } } Listing 5.3: The revoke_service verb.

The verbs that are initiated by the endpoint interested in engaging the service are request_- service and conclude_access. As with the register and revoke service verbs, the request service verb is followed by a response from the firewall. In addition to the error codes, the reply in this case includes location information enabling the client endpoint to invite the service endpoint to join the session. The conclude_access verb is a notification sent from the client endpoint to the firewall. These verbs are illustrated in Listings 5.4, 5.5, and 5.6.

We acknowledge that other verbs may be included in the protocol to enrich the functionality.

These may include verbs initiated by the service endpoints requesting configuration updates.

However, for now, we limit ourselves to the minimum set of verbs required to demonstrate the use case of endpoint interactions with firewalls. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 182

{ "VERB" : "request_service", "SOURCE" : "alice.meetup", "TRANSACTION_ID": "1894", "TIMEOUT" : "40", "SESSION_LABEL" : "meetup.org", "AUTH_TOKEN": "b5936100164692ef94e3f52253d73be7", "PAYLOAD" : { "list" : [{ "TYPE": "REQUEST", "SERVICE_LABEL": "HTTP", "SERVICE_ENDPOINT_LABEL": "bob.meetup" }] } } Listing 5.4: The request_service verb.

{ "VERB" : "request_service", "SOURCE" : "firewall.meetup", "TRANSACTION_ID": "1895", "TIMEOUT" : "40", "SESSION_LABEL" : "meetup.org", "AUTH_TOKEN": "2c1743a391305fbf367df8e4f069f9f9", "PAYLOAD" : { "list" : [{ "TYPE": "REPLY", "SERVICE_LABEL": "HTTP", "SERVICE_ENDPOINT_LABEL": "bob.meetup", "CURRENT_INTERFACE": "10.10.1.54", "SERVICE_PORT": "9786" }] } } Listing 5.5: The reply to the request_service verb. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 183

{ "VERB" : "conclude_access", "SOURCE" : "alice.meetup", "TRANSACTION_ID": "1896", "TIMEOUT" : "40", "SESSION_LABEL" : "meetup.org", "AUTH_TOKEN": "b5936100164692ef94e3f52253d73be7", "PAYLOAD" : { "list" : [{ "TYPE": "REQUEST", "SERVICE_LABEL": "HTTP", "SERVICE_ENDPOINT_LABEL": "bob.meetup", }] } } Listing 5.6: The conclude_access verb.

Actions Taken by the Middleboxes in Response to Verbs

In response to the requests made by endpoints, the middlebox performs a variety of actions.

These include validating requests as well as performing authentication and authorization checks.

As explained earlier, it is left up to the firewall to determine how to implement configuration updates for service deployment. To illustrate the actions, we take the simple example of a con-

figuration change that allows traffic coming from a client’s location through to a particular port on a specific interface and nothing more. In a complex example, a service may require several configuration changes, such as allowing traffic from multiple sources to multiple endpoints

(e.g., web application, media server) behind the firewall.

In terms of the workflow in Figure 5.4, the provisioning and tear down of resources includes definition of firewall configurations. These include definitions of services, zones or sources. On the other hand, when allocating and deallocating resources, these configurations are applied.

For example, services are added to zones to allow traffic to the service and sources are added CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 184 to zones to allow traffic from the client through to the service. Further details are discussed in

Section 5.3.5.

5.3.4 State Machine

The state of the firewall in response to verbs may be described by a simple state transition dia- gram, which is shown in Figure 5.5. The state of the firewall is represented in terms of access to the service. Once the service is registered with the firewall, it is classified as being provisioned.

When a client requests access to the service, the firewall instantiates configurations and enters the active state for that client. The firewall returns to the provisioned state when the client concludes communications with the service. Access to the service is revoked when the service submits a valid request to do so. For the sake of clarity, the state digram in Figure 5.5 does not include transitions due to errors.

Service request validated

Provisioned

Access concluded / Access request reservation validated expired Revoke request validated Active

Revoke request validated

Figure 5.5: Firewall state machine CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 185

5.3.5 Implementation Considerations

Here we present the implementation considerations, while continuing with the example that we’ve discussed in Section 5.3.3.

As we explained in Section 3.3.2, verbs are implemented through handlers that are included with the SLIM library. Therefore, the verbs that we’ve listed above have corresponding handlers implemented that interface with the firewall implementations (see Figure 5.1). In our case, we’ve used the firewalld interface to implement the verbs.

Listing 5.7 summarizes the select and necessary interface calls that are made to the firewall daemon to enact the verb requests. Initialization involves defining default configurations that allow the firewall to fall back to a set of rules when relevant rules are not defined. Following the register service verb, a service (in the context of a firewall) is created with associated configuration. Provisioning of resources includes assigning the service to a particular zone so as to make it available within that zone. Doing so may also involve additional configurations where specific ports may need to be opened, in addition to the configurations that are defined as part of the service definition. Allocation of resources for client results in adding the client configuration to the same zone as the service. Defining a separate zone allows us to sandbox traffic from the client to the service and back. The deallocation of resources and revoking the services are essentially geared towards reverting configurations.

Note that the information exchanged as part of the verbs is in relation to communication goals, where as the configuration changes involve specific rule configurations that are implemented by the firewall. The influence of policy can also be reflected through rules implemented as part of the configuration. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 186

{ # initialization firewall cmd set default zone=public # register_service firewall cmd new service=our_service permanent firewallcmd reload # provision service firewall cmd new zone=synergy permanent firewallcmd zone=synergy change interface=eth0 \ permanent firewall cmd zone=synergy add service=our_service \ permanent firewall cmd zone=synergy add port=443/tcp permanent firewallcmd reload # allocate resources firewall cmd zone=synergy add source=192.168.0.4 \ permanent firewall cmd reload # deallocate resources firewall cmd zone=synergy remove source=192.168.0.4 \ permanent firewall cmd reload # teardown firewall cmd zone=synergy remove service=our_service \ permanent firewall cmd zone=synergy remove port=443/tcp \ permanent firewall cmd zone=internal change interface=eth0 \ permanent firewall cmd reload } Listing 5.7: List of select and necessary firewalld commands to enact configurations requested by associated verbs. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 187

5.3.6 Cascaded Access

The configurations enacted by middleboxes are not limited to the middleboxes that receive the requests of endpoints. Instead, there may be scenarios where a request by an endpoint cascades to requests to other middleboxes to setup communications. An example of such a scenario is discussed at length in Section 3.7.

Essentially, cascaded access is realized when multiple middleboxes that are providing similar or different services are required to setup communications. Consider the example where we have two SLIM-enabled firewalls along the data path. The client engages the first firewall to setup communications. However, to get to the service endpoint, the second firewall needs to be engaged to allow reachability. The implementation considerations are left to developers to choose the methods for realizing SLIM verbs. This is because there may be different, equally viable, solutions to the problem. In case of the example above, the SLIM verbs implemented by the client may iteratively access the middleboxes to setup communications or the client may engage the first firewall and delegate the responsibility of setting up further communications.

(We see both approaches implemented by DNS solutions.) Each approach has its benefits and drawbacks and therefore, it is the use case that dictates which method needs to be adopted as a solution.

In essence, cascaded access will necessary for realizing modern use cases due to the need for various middleboxes in realizing rich communications.

5.3.7 Backward Compatibility

In terms of practical deployments, there will be scenarios where the middleboxes along the data path do not support SLIM. In such cases, clients (or upstream middleboxes) will fall back CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 188 to legacy behavior when they try to explicitly engage middlebox services and find that SLIM is not enabled on the downstream middlebox. The fallback to legacy behavior does not suggest that communications will not be possible, instead, communications will be setup as is typically done, where middlebox services are configured upon deployment and they transparently assist communications — without the upstream or downstream network elements being aware of their existence. In such cases, while communications will happen, the ability to dynamically configure or reconfigure communications will be lost due to SLIM’s absence.

Consider the example of a data path where there are two firewalls along the data path between endpoints, where one firewall supports SLIM and the other does not. In this case, assume that the SLIM-enabled firewall is the first that the client encounters when communicating with the service endpoint. Here, a SLIM-enabled client will engage the SLIM-enabled firewall to dynamically setup communications. However, from there onwards, communications will be established by legacy means, where the firewall closer to the service would have to be pre- configured by system administrators to meet policy requirements. End-to-end communications will take place, although part of it would be setup through SLIM, while the rest would fall back to legacy deployments.

5.3.8 Policy Enforcement and Access Control

As we see in Figure 5.1, a middlebox may host a policy enforcement engine and access control mechanisms. While research on the subject of policies enforcement and access control falls outside the scope of our study, we see that SLIM’s ability to support dynamic configuration

(through verbs) creates ample opportunity to implement such mechanisms.

Note that policy enforcement and access control can be enabled in different contexts, for ex- ample, when the middleboxes support SLIM and the endpoints do not or when both support CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 189

SLIM. In both these cases the implementation of SLIM verb handlers makes it possible to enact

policy and manage access control.

Consider the example where middleboxes support SLIM, while the endpoints do not. In such

cases, when enacting policies for routing traffic from a particular research lab over the research-

and-education networks (instead of default ISP), the implementation of SLIM verbs and their

handlers can enable dynamic reservation of resources on the middleboxes and lease data paths

accordingly.

Similarly, as we show in Figure 5.4 when discussing a typical workflow of interactions, the

implementation of authenticate and authorize verbs dictate how access control may be realized.

5.4 SLIM Extensions in Specialized Domains

SLIM’s extensions to network stack are not limited to enterprise computing or networking in

general. We establish that they may be successfully applied to specialized domains as well,

for example high-performance computing. The approach of breaking the coupling of session

and transport semantics (in implementations) enables extensibility (as discussed above) and

supports features such as resilience and fault tolerance.

Here we highlight the case of working towards enabling resilient communications for MPI

implementations, particularly those using Open MPI. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 190

5.4.1 Background

Large-scale computational problems (e.g., weather simulations) that use MPI are typically structured by breaking down the job into subproblems. A set of processes is then responsible for solving one or more of the subproblems. When solutions to all subproblems are obtained, the partial results are aggregated to obtain the overall results. With such work distribution, each process’ role is important. If results from all worker processes are not available, the pro- gram’s results will be incomplete or invalid. Therefore, if such a process fails, the entire job is bound to fail — unless there are mechanisms in place to mitigate those failures [162, 163].

Therefore, mitigating the impact of transient or localized faults, which might cascade into system-wide collapse, and recovering from these failures is of significant importance. We de-

fine transient failures as fleeting events (e.g., those caused by interconnect congestion) and localized failures as faults confined to a limited set of hardware or software resources (e.g., those caused by node or process failure).

Enabling resilience for large-scale parallel computations is particularly important as we scale up from compute capabilities of petaflops to exaflops [164]. Scaling up to exaflops will in- evitably involve increasing the number of compute processes working in parallel, and it is well-established that as the number of cores increase, so do the number of faults [162, 165] — we can imagine the increase in points of failure with the increase in number of constituent components of an exascale supercomputer.

Researchers have investigated various dimensions of enabling resilience in MPI programs [162,

163]. These efforts include checkpoint and recovery [166, 167], user-level fault mitigation

(ULFM) [168–170], process-level redundancy [171], log-based recovery [172, 173], dynamic process management [174], modified MPI semantics [175, 176], algorithm-based fault toler- ance (ABFT) [177, 178], use of intercommunicators with master-worker configuration [179], CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 191 transactional communication [180], and renewed MPI implementations to include resilience [181]. We will discuss some of these proposals in the related work section in § 5.4.7.

Through a session-layer intermediary (SLIM) [69] we are able to enable resilient communica- tions that mitigate and eventually resolve transient or localized faults, which are introduced by faulty network communication. Examples of such faults include single or multiple fail- ures of MPI primitives caused by failing network paths or congested interconnects. We enable resilience by separating session and transport semantics, which are conflated in the implemen- tations. We make a case that it is this conflation that results in strong coupling and thus inhibits adaptations and recovery in the face of transient events.

While our current contributions are geared towards Open MPI’s TCP component [182], the same can be applied to the BTL OpenIB component. This proposal of separation of session and transport semantics for OpenIB has also garnered interest by vendors [183].

Since SLIM serves as a shim layer that envelopes the underlying interconnects and interfaces with MPI frameworks (e.g., Open MPI components [182]), all user programs would run with- out any change to their code. We do not propose any changes to the MPI standard either.

5.4.2 Characterization of MPI-Related Faults

To better highlight the scope of our work and contribution here, we characterize the types of faults in relation to MPI.

We define faults as events that result in abnormal operation. We classify faults into two cate- gories: software and hardware faults.

Software faults are induced by bugs in either users’ source code or the MPI implementations.

We do not address software faults here. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 192

On the other hand, hardware faults pertain to the underlying infrastructure. We further clas- sify these as soft or hard faults. We define soft faults as those that do not necessarily indicate imminent failure of the MPI program, although they may cause individual ranks to fail. These faults can be detected and possibly resolved. In contrast, we define hard faults as those that interrupt the execution of the MPI program in a manner that results in failure, and they may not be detected. Hard faults may at times result in system-wide failures (e.g., failure of GPFS).

Here we focus on soft faults induced by network communication which include transient and localized failures. We aim to resolve transient faults — those that do not cause an MPI rank to fail — without having to trigger user-level failure mitigation (ULFM), checkpoint and restore, or other fault tolerance mechanisms. As mentioned in § 5.4.1, we define transient failures as

fleeting events (e.g., those caused by interconnect congestion) and localized failures as faults confined to a limited set of hardware or software resources (e.g., those caused by node or process failure). The association of these faults is summarized in Figure 5.6.

Fault User code Buggy Software Hardware implementations Imminent failure Soft Hard Fleeting errors Transient Localized

Figure 5.6: Our classification of faults

5.4.3 Approaches Towards MPI Resilience

Fault tolerance for MPI is a well-studied domain [162]. As we briefly highlighted in § 5.4.1, there have been multiple notable contributions along different dimensions of this challenge — CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 193 e.g., [169, 171–173,175,177].

Assumptions of Fault-Tolerant Methods

Although different fault-tolerant mechanisms address challenges along different (and some- times orthogonal) dimensions, all methods tend to make certain assumptions about the pro- cess [184]. The assumptions are as follows: (1) we are able to detect the failure; (2) we have enough state information to be able to recover from the failure; and (3) we are allowed to instantiate recovery mechanisms to mitigate the faults.

Taking these assumptions into account, we use a state machine diagram to model the detection, mitigation and recovery processes of communication faults (see Figure 5.7).

user level fault allocate management resources Valid event or failure Execution resolved interruption soft / transient event detected Failure Recovery Detected job recovery completion hard error failed detected reclaim resources

Terminate

Figure 5.7: State diagram illustrating fault detection, mitigation, and recovery.

Types of Fault-Tolerant Methods

Here we briefly describe the types of fault-tolerant methods, in the context of soft faults that are transient. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 194

Master Remote Group Group A

Remote intercommunicator Group B

Figure 5.8: Master and worker configuration between groups of processes using MPI intercommunica- tors

Checkpoint and Restore: The foremost and most widely known method of enabling resilience is through checkpointing and restore (e.g., [166]). Checkpoint and restore entails saving the state of the program at regular intervals so that the application may be restarted from there onwards, instead of having to start from the very beginning. Most of the ULFM methods involve different forms of checkpoint-and-restore methods.

Checkpoint-and-restore methods are typically considered to be expensive ways to enable re- silience, particularly at scale. However, there are scenarios where the most efficient approach is to use checkpoint and restore, since the overheads depend entirely on the type of the appli- cation, the frequency of checkpoints, and the ability to restore states from various stages of the computation.

Due to the inherent nature of the approach, checkpoint and restore allows recovery from tran- sient communication faults, whether it is in terms of recovering from a failed primitive, a failed rank, or even a total failure. However, we argue that such an approach is a very heavyweight way to mitigate transient faults. Such faults can be resolved by much more lightweight ap- proaches.

Methods Supported by Standard MPI Implementations: The primary reason for the misun- derstanding that "if a rank fails, then the entire MPI program will fail" stems from the interpre- CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 195 tation that all MPI ranks exist within one communicator. The roots of such practices originate from the use of collective operations while solving compute problems. Collective operations communicate within one communicator. To minimize the complexity of the MPI program, the typical practice is to use the default communicator, which is MPI_COMM_WORLD. When a rank within a communicator fails, or a collective operation fails, it may cascade into the failure of the program if not handled through the users’ source code.

The use of intercommunicators [179] encourages compartmentalizing ranks into groups and following the manager/worker paradigm, as illustrated in Figure 5.8. Intercommunicators enable communication of ranks between groups. When a rank fails, for whatever reason, the fault is compartmentalized. This compartmentalization allows the user to manage complexity of the source code and handle the error gracefully using suitable methods (e.g., redundant processes [171] or dynamic process management [174]).

We argue that using intercommunicators to mitigate transient communication failures is also heavier weight than it needs to be because it incurs significant overheads.

Modifying the MPI Standard to Revise Semantics: Another rarely used approach is to revise the semantics of MPI primitives in order to enable fault tolerance. Such approaches have been tried in the past to enable fault tolerance (e.g., [175]). However, these approaches can limit application portability: they have the potential of rendering user codes that are written for such implementations incompatible with other MPI implementations that comply with the

MPI standard.

MPI Extensions: An alternate approach to revising MPI semantics is to add extensions to MPI

(e.g., [184]). Doing so not only maintains compatibility of the implementation with the MPI Standard, but also allows additional functionality that applications may choose to leverage.

This includes, for example, defining suitable error codes and corresponding error handlers. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 196

We adopt the approach of adding an MPI extension to enable fault tolerance, since this not only complies with the MPI standard but also avoids the need to recompile MPI programs. We discuss the details further in § 3.5.

5.4.4 SLIM’s Integration with MPI

To begin, we chose to add our SLIM extension to Open MPI primarily because of its modular component architecture (MCA) [182]. The design of Open MPI, frameworks coupled with components, and the manner in which they are interfaced, instantiated, and used, allows us to confine our proposed extensions to select components. As illustrated in Figure 5.9, the focus of our contribution is on interfacing with the Byte Transfer Layer (BTL) framework. Initially, our contribution is geared towards interactions with the TCP component. In future we intend to include interfacing with other components, such as OpenIB.

MPI Application Open MPI core (OPAL, ORTE, and OMPI layer)

MPI byte MPI Process IP Distributed MPI one High transfer collective launch & interfaces filesystem sided resolution layer operations mon. operations … timers (btl) (coll) (plm) (if) (dfs) (osc) (timer)

CP … … … … … … … ml rsh T app base base base base base base base linux pt2pt slurm rdma orted tuned darwin openib windows posix_ipv4

Focus of our contribution

Figure 5.9: Open MPI Architecture (recreated from [182]). CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 197

Open MPI Byte Transfer Layer (BTL)

The BTL framework works alongside the BTL Management Layer (BML), Point-to- Point Mes- saging Layer (PML) and the MCA frameworks. BTL is geared towards providing a uniform method of data transfer between participants. The data transfer may be over different inter- connects. Here, we focus on TCP alone.

Conflation of Session and Transport Semantics by Legacy BTL-TCP

The coupling of session and transport semantics in legacy TCP causes difficulty in implementing fault tolerance for MPI communications. If there is a loss of network path between endpoints or a transport connection faces a timeout, the BTL TCP connection will drop. This will result in a failed MPI primitive, which may then cascade to a failure of the MPI program.

Thus, it is imperative that we consider session semantics, which is the notion of communication between endpoints, independent of the underlying transport or transfer of data across the network.

Enabling Fault Tolerance

Decoupling of session and transport semantics allows us to handle transient failures (related to transport implementations) without inducing a failure of an MPI primitive. If a transport connection fails or faces a timeout for some reason, i.e., it was not cleanly terminated, SLIM recognizes the fault and attempts to setup a new transport connection to serve as a replace- ment. The flow is then mapped on to the new transport connection, the sequence spaces are synchronized to ensure that no data is lost, and communication resumes. Details of how this mapping of sequence space is managed is discussed in our prior work [69]. The correct syn- CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 198 chronization of sequence spaces ensures that no data is lost — which may have been in flight.

An illustration of such a disruption, followed by a successful reconfiguration is illustrated in

Figure 3.4.

As we will discuss further in § 5.4.6, this decoupling incurs no overhead during fault-free oper- ation since the indirection comes into play only during connection setup or transport recovery.

During communication, latency and performance are negligibly impacted as SLIM essentially acts as a pass-through for application payload.

Transient faults vs. rank failures It is important to note that SLIM enables resilience in the face of transient failures alone. This is in contrast with the scenario where a rank fails.

When a rank fails, it is likely that program state will be lost. Therefore, resuming connectivity with a replacement process may not suffice. There may be rare cases where the nature of the application allows such possibilities. In order to resume communication following a rank failure, a mechanism for automatic recovery must be in place — e.g., application directed recovery, message logging and replay, and so forth.

SLIM makes three attempts to recover from a transient failure. If all three attempts to resume connectivity fail, the error is escalated and may result in the MPI program’s failure, unless alternate mechanisms are in place.

SLIM’s Integration with BTL

Figure 5.10 summarizes our incremental integration and deployment approach. Initially, we have implemented SLIM as a user-space library. We setup the library to intercept the Socket API calls [8] using LD_PRELOAD. The benefit of this approach is that we do not force the recom- pilation of either the MPI program or the Open MPI implementation. All socket interface calls go through the LD_PRELOAD wrapper, through SLIM, to the underlying TCP implementation, CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 199 all the while providing resilient flow implementations to BTL. In the future, we will extend the

BTL implementation to interface directly with SLIM.

Phase I II III

Legacy BTL Legacy BTL BTL BTL

wrapper SLIM LD_PRELOAD SLIM SLIM TCP … OPENIB TCP … OPENIB TCP … OPENIB TCP … OPENIB integration with incremental patched openmpi legacy btl to include deployment codebase (btl & tcp) interconnects

Figure 5.10: Incremental deployment and integration with Open MPI Byte Transfer Layer (BTL)

The goal is for session-based abstractions to be exposed to BTL while SLIM manages the map- pings to underlying transports. Following along the same lines, we plan to use SLIM to enable resilience to transient faults for all underlying interconnects. As the next step, we plan to integrate SLIM with OpenIB alongside TCP.

5.4.5 Prototype Implementation

We discuss our contributions with reference to the prototype implementation.

Prototype for BTL TCP Component

The prototype for the BTL TCP component is implemented as a user-space library in C. The implementation includes 3189 lines of source code. SLIM’s interface, which exposes the session primitives, is exposed to BTL and is illustrated in Figure 5.10. Details of SLIM’s implementation that are specific to TCP are documented in [69,72].

Interfacing with BTL and TCP SLIM serves as a wrapper around the Socket API. As part CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 200 of Phase I of our development, the Socket API calls are intercepted by SLIM through LD_-

PRELOAD, where SLIM maintains the session, flow, and endpoint state. These abstractions are mapped onto the underlying transport (e.g., a flow is eventually mapped to an underlying socket that serves as an input and output stream).

This indirection for communications enables fault tolerance. Consider the scenario where a transport connection faces a timeout due to a transient failure. The termination of the connec- tion results in a failed TCP socket. SLIM recognizes the abnormality and attempts to reconnect to the destination. The assumption here is that if the fault was transient, the network will be available for later communication. Spawning a new transport connection and having the

flow mapped onto this new transport avoids the failure of the MPI primitive that generated the communication event. Thus, the indirection allows us to catch and recover from a failure that may have caused the program to fail.

Note that SLIM recognizes failures by evaluating the error codes returned due to failed read and writes to underlying sockets (or file descriptors).

Legacy App App LD_PRELOAD wrapper SLIM SLIM Socket API Socket API SLIM supporting SLIM enabling greater legacy applications functionality

Figure 5.11: SLIM in relation to legacy applications and those using the library.

As part of Phase II, we plan to extend the BTL implementation and integrate SLIM without having to use LD_PRELOAD to intercept Socket calls.

Backwards Compatibility To enable backwards compatibility with legacy TCP stacks, we use custom TCP options [71, 72]. If the peer stacks are unable to exchange the custom options CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 201 during the 3-way handshake, SLIM recognizes that the peer stack does not support SLIM and subsequently falls back to legacy TCP behavior.

Prototype for BTL OpenIB Component

As part of Phase III, we plan to expand the SLIM implementation to integrate the OpenIB inter- connect. The plan is to have a uniform interface exposed to BTL, and have SLIM interact with the underlying BTL module when instantiated by MCA — be it TCP or OpenIB. The objective is to have a separation of session and transport semantics so that transient faults may be caught in time to allow suitable recovery (or graceful degradation), instead of having the application fail.

5.4.6 Discussion

In this section, we discuss preliminary performance results and overheads when testing SLIM in a controlled environment, as well as the concerns of deployment and interactions with the infrastructure.

Performance and Overheads

As part of the performance evaluation, we try to understand the impact the addition of SLIM has on performance, especially on latency. Since SLIM is only involved during communication setup and does not play a significant role during ongoing communications, we do not expect to see any significant overheads. The only role that SLIM plays during ongoing communications is an added level of indirection (i.e., flow to socket descriptor mapping), which should not incur a significant overhead, even for latency sensitive systems — e.g., MPI applications. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 202

Figures 5.12 and 5.13 summarize the latency and throughput measurements from microbench- marks ran on an unprimed configuration. We see that the SLIM implementation — analogous to a BTL, SLIM, and TCP component — has statistically similar performance to that of a legacy configuration with the socket API — analogous to the BTL and TCP component implementa- tion.

Socket API SLIM 200

s) 150 μ 100

Latency ( Latency 50

0 1 8 15 22 29 36 43 50 Measurements

Figure 5.12: Trace of average latency for BTL+TCP (Socket API) vs BTL+SLIM+TCP (SLIM) using unprimed long-running microbenchmarks (1 Gbps link capacity, 0% loss)

Socket API SLIM 1000 800 600 400 200 0 Avg.Throughput (Mbps) 1 8 15 22 29 36 43 50 Measurements

Figure 5.13: Trace of average throughput for BTL+TCP (Socket API) vs BTL+SLIM+TCP (SLIM) using long-running microbenchmarks (1 Gbps link capacity, 0% loss)

The round trip latencies for application’s point to point communication hover around an av- erage of 120 µs for both SLIM and the legacy implementations. In case of bandwidth tests, CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 203 the point-to-point tests are able to saturate the link up to the achievable link capacity of nearly

94%. The variability in results is due to the aggressive back-off mechanism of the TCP New

Reno implementations, which were used for these tests. Newer congestion control implemen- tations (such as TCP BBR or CUBIC may show lesser variations).

Collective Operations

While the focus of our discussions have been on a shim between BTL and TCP for point to point communications, note that SLIM supports communication between multiple participants as part of the session. This is because the abstractions have been developed to support separate sessions from transport semantics and therefore mitigate the limitations of underlying transport mechanisms that inhibit extensions (such as fault tolerance). However, using the multi-party session semantics of SLIM as a replacement for the COLL framework will not be efficient. This is because the multi-party session semantics are geared towards supporting participants for traditional networking and not high-performance collective operations. However, SLIM when used as part of the BTL framework — as means for reliable data transfer — performs at par with the legacy implementations.

As we move towards Phase III of our development, we plan to include the OpenIB intercon- nect. There, we will study the impact of separation of session and transport semantics and its influence on collective operations.

Incremental Deployment and Integration with Open MPI

In § 5.4.4, we discussed our development plan and summarized it in Figure 5.10. We see that initially with the wrapper library, we may use LD_PRELOAD to deploy the library. This would not require rewriting or recompiling any code, whether the application or the Open CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 204

MPI implementation. Doing so enables incremental deployment for the BTL TCP component.

However, as we move to Phase II and III where we not only extend BTL to interface directly with SLIM, but also expand SLIM to interface with OpenIB, there would be a need to recompile and deploy the updated Open MPI implementation. The applications would not require any recompilation.

Interaction with Middleboxes

Unlike traditional networks, interaction with middleboxes is not a concern here, since data center deployments are typically devoid of middleboxes between compute nodes. Nevertheless, we’ve demonstrated in our prior work [69, 71, 100] that SLIM is not adversely impacted by middleboxes even if they exist in traditional networks.

5.4.7 Related Work

In the last decade or so, researchers have investigated various dimensions of enabling resilience in MPI programs [162, 163]. It appears that the philosophy of fault tolerance has moved to- wards enabling users to mitigate the impact of faults and recover by trapping errors in user code and implementing suitable solutions. This is understandable as applications have differ- ent characteristics and the impact of the a fault may be severe for one application but not be as severe for the other. Nevertheless, we observe that all these faults typically lie in the category of what we classify as hard faults. Handling transient faults is left entirely up to the users, who tend to use methods such as dynamic process management [171] or checkpoint and re- store [166] to deal with them. While this approach yields results, we’ve argued that they are expensive for transient faults. We focus on transient network communication fault for the BTL

TCP component as a case study and suggest SLIM as a suitable solution. CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 205

Below we summarize some of the notable and representative approaches that enable fault tolerance for MPI.

Fagg et al. [175] propose FT-MPI (Fault Tolerant MPI) which augments the MPI implementa- tion and maintains more state to determine what actions can be taken when processes in the communicator encounter failures. This changes the standard MPI semantics. For example, the

MPI communicator is allowed to have different states (other than the original valid and invalid states), which are determined by the failure scenario it is experiencing. On detecting a failure in the communicator, the application can go into a failure recovery mode that is specified by the application developer. Thus, FT-MPI allows application developers to have different fail- ure recovery modes, other than simply checkpointing and recovery. While FT-MPI sacrifices a great deal in terms of the time-tested semantics of standard MPI, the lessons learned have been incorporated in current Open MPI implementations.

The User Level Failure Mitigation (ULFM) [168] interface has been proposed to provide fault- tolerant semantics in MPI. The interface focuses on fail-stop failures only and allows application- level failure detection and local failure migration based on removing the failed processes by shrinking communicators. Laguna et al. [169] show that as processes continue to fail, the time to revoke and shrink the communicator increases linearly with increasing number of nodes. In addition, the paper shows that the interface is only suitable for jobs that have work- decomposition flexibility which exists for instance in a master-slave application model. How- ever, for more general applications such as bulk synchronous MPI applications, the interface has few benefits.

MPICH-V [166] is a fault-tolerant MPI implementation designed for large clusters, where fail- ures or disconnection between nodes are common events, resulting from human error, and hardware or software faults. MPICH-V adds fault-tolerance by using uncoordinated check- pointing and distributed pessimistic message logging. An essential goal of the work is to allow CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 206 ease of use that allows running old applications without modification, ensuring transparent fault tolerance for users, etc. However, the paper shows that MPICH-V incurs an overhead of about 23% on a job’s runtime when no failures occur. Also, the uncoordinated checkpoint- ing is not user directed; instead, it is system directed, which may result in significantly poor performance as application characteristics are not taken into account.

Dynamic process management can provide fault tolerance in MPI programs. Gropp et al. [174] show how the existing MPI specification, which usually serves as a message-passing system, can be extended to include an application interface to the system’s job scheduler and process manager, or even to write those functions if they are not already provided. The specification allows running processes to spawn new processes and communicate with them. However, developers still need to handle explicitly issues such as resource discovery, resource allocation, scheduling, profiling, and load balancing.

Another dimension of dealing with faults is to develop algorithms that are inherently tolerant.

To help matrix factorizations algorithms survive fail-stop failures during parallel execution, Du et al, [177] propose a hybrid approach based on algorithm-based fault tolerance (ABFT) that can be applied to several ubiquitous one-sided dense linear factorizations. Using LU factoriza- tion, the authors prove that this scheme successfully applies to the three well known one-sided factorizations, Cholesky, LU and QR. The algorithm protects both the left and right factoriza- tion results. ABFT protects the right factor with checksum generated before, and carried along during the factorizations. A scalable checkpointing method protects the left factor. However, the work does not support multiple simultaneous failures.

5.4.8 Future Work

In future we plan to explore the following directions of research and development: CHAPTER 5. ENABLING NEW COMMUNICATIONS PARADIGMS 207

1. In Phase II of the development, we plan to extend the BTL framework to directly interface

with SLIM and integrate SLIM as a patch for the Open MPI code base.

2. In Phase III we plan to expand SLIM to interface with other interconnects, particularly

OpenIB. This would allow BTL to access the underlying interconnects with SLIM’s uni-

form interface. This work has been of interest to interconnect vendors [183].

3. We plan to study Open MPI’s COLL framework and see if we can unify the both COLL

and SLIM’s multi-party communication mechanisms to help with collective operations.

This may either be in terms of semantics or in terms of implementation optimizations.

4. We also plan to investigate how we may apply the lessons learned through SLIM and

Open MPI implementations to other MPI implementations (e.g., MPICH).

5.5 Summary

Here, we’ve laid out an argument for enabling new paradigms of communications by using

SLIM for session based communications. We highlight the possibility of engaging middleboxes

as first-class citizens, such that endpoints interact with them explicitly. To illustrate, we discuss

the example of a firewall and how with explicit interaction, we can enable robust communica-

tions. We argue that the same may be applied to other middleboxes including Captive Portals,

Network Address Translators, and Load Balancers.

We also present a case for SLIM’s application in the specialized domain of high-performance

computing and show how we may enable resilient communications and other extensions to

the network stack. Chapter 6

Summary and Future Work

6.1 Summary of Dissertation

In this dissertation, we raised the issue that enabling modern communication use cases with

legacy network stack implementations is a significant challenge. As a consequence, we see

application developers not only implementing application features, but also the supporting

mechanisms to realize those features. The challenges are primarily due to the limitations of

the underlying network stack implementations. The limiting assumptions made by these im-

plementations raise a variety of challenges. These include the limiting networking abstractions

exposed to the applications, immutable nature of the configurations, naming constraints, and

limited conversation-state management.

To address these challenges, we develop a communication model rooted in session-based com-

munications. We show that by using a session-based communication model, we can: 1) de-

scribe communications using session abstractions, enabling stacks to maintain context of com-

munications, in accordance with the needs of modern communications; 2) enable dynamic

208 CHAPTER 6. SUMMARY AND FUTURE WORK 209

configuration of communications; 3) address the limitations of underlying network stack im-

plementations (such as coupling of session and transport semantics, limiting abstractions, and

immutable configurations); 4) realize new communication paradigms with the ability to dy-

namically configure network stacks; and 5) do so in a manner that is backward compatible

with legacy stack implementations.

6.2 Related Directions of Research

There are several avenues that may be explored in relation to session-based communications

and our SLIM framework. Below we summarize select notable directions of this research.

6.2.1 Cross-Layer Communication

With reference to maintaining context of communications, we see that network stack imple-

mentations, particularly kernel implementations, adopt a monolithic design. One of the pri-

mary reasons for this design decision is performance. However, access to communications’

context is another driver towards this decision. Consider for example, the scenario where link

state of the network interface (connected, disconnected, connected but not configured etc.) is

required by the higher layers (e.g., transport). At the moment, such information is accessed

through ad hoc implementations. Consequently, if the kernel implementations do not include

patches to allow for a higher layer to obtain information from a lower network layer (or vice

versa), there are no means by which the layers may query each other.

This brings us to questions such as: what information may be accessed from different layers?

Would access to said information benefit the design of communications? What would the

interface and protocols for such interactions include, and whether guidelines may be defined CHAPTER 6. SUMMARY AND FUTURE WORK 210 for current monolithic implementations to enable support for layers to query each other?

Answering these questions will have significant impact on communications that involve change in context while the conversations are taking place.

6.2.2 Expanding New Communication Paradigms

We discussed earlier how SLIM may be used to introduce new communication paradigms, such as treating middleboxes as first-class citizens, and explicitly including them in communications’ setup. We also explained that while we can present a holistic view of what such interactions may look like, we highlight that interactions with each class of middleboxes will involve a specific implementation of SLIM verbs. We demonstrated the case of interactions with a firewall in setting up communications. However, there are a myriad of middleboxes that require such interactions where current approaches have ad hoc implementations. Captive portals are one such example. Similarly, other middleboxes such as application accelerators, proxies, load balancers can also enable much more robust communications.

6.2.3 Policy Management and Enforcement

When it comes to communication setup, policy management and enforcement has been typi- cally done through ad hoc means. Network architects and system administrators painstakingly work through the organization’s policy and manually translate them to methods by which they may be enforced. This is particularly relevant to Science DMZ [?] use cases. Consider for ex- ample, the case where network traffic from a research lab is to be directed through the National

Research and Education Networks (NRENs) in contrast with other traffic from the university campus that exits through the Internet Service Provider. The routes that are deployed for such a configuration are setup manually. However, if the endpoints were to engage the middleboxes in CHAPTER 6. SUMMARY AND FUTURE WORK 211 setting up communications and were there an interface by which the network architects could define the organization’s policy, we could envision a much more sophisticated network config- uration that is dynamically instantiated. A simple example of the same may be envisioned in relation to Captive Portals in a hotel that manages the guests’ network access.

Here the following questions and similar queries would need to be answered: How would the policy be defined? What would be the vocabulary for such definitions such that it may be suf-

ficiently generic in nature? How would such an interface be implemented? How would the implementation interact with middleboxes and network elements? How would policy enforce- ment implementations leverage existing efforts such as OpenFlow and OpenDataPlane?

6.2.4 SLIM’s Application in Specialized Domains

As we shared earlier, SLIM’s ability to enable dynamic configuration opens the door to robust communications where context changes during conversations. In this context, SLIM’s applica- tion to a variety of specialized domains may enable features such as fault tolerance and multi- party communications. On the other hand, the user of session-based abstractions would lead towards simpler designs and implementations. Specialized domains that require adaptation to changing context would benefit from the support mechanisms that SLIM provides. These domains include Autonomous Systems and Unmanned Aeriel Vehicles (that temporarily join a communication session with multiple participants and later leave), High-Performance Comput- ing (that requires collective communications between multiple participants), Internet of Things

(that communicate in an ad hoc and possibly mobile network with changing context), and En- terprise & Cloud Computing Environments (that use containers to sandbox compute elements and are required to update network state when the virtual environments require configuration updates). Bibliography

[1] J. Postel, “Transmission Control Protocol,” RFC 793 (INTERNET STANDARD), Internet Engineering Task Force, Sep. 1981, updated by RFCs 1122, 3168, 6093, 6528. [Online]. Available: http://www.ietf.org/rfc/rfc793.txt

[2] C. Labovitz, “Internet Traffic Trends,” NANOG 43, 2008, North American Network Operators’ Group. [Online]. Available: https://www.nanog.org/meetings/nanog43/ presentations/Labovitz_internetstats_N43.pdf

[3] V. G. Cerf and M. P. Singh, “Internet Predictions: Future Imperfect,” IEEE Internet Computing, vol. 14, no. 1, pp. 10–11, Jan. 2010. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5370817

[4] “New Features in Android OS 5.0,” 2015. [Online]. Available: http: //android-developers.blogspot.com/2014/10/whats-new-in-android-50-lollipop.html

[5] “iOS and OSX Handoff,” 2015. [Online]. Available: https://developer.apple.com/ handoff/

[6] J. Rexford and C. Dovrolis, “Future Internet architecture,” Communications of the ACM, vol. 53, no. 9, Sep. 2010. [Online]. Available: http://portal.acm.org/citation.cfm? doid=1810891.1810906

[7] A. Medina, M. Allman, and S. Floyd, “Measuring the evolution of transport protocols in the internet,” ACM SIGCOMM Computer Communication Review, vol. 35, no. 2, p. 37, Apr. 2005. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1064413. 1064418

[8] “POSIX.1-2008 Specification,” 2015. [Online]. Available: http://pubs.opengroup.org/ onlinepubs/9699919799/functions/contents.html

[9] B. Ford and J. Iyengar, “Breaking up the Transport Logjam,” in ACM HotNets-VII, 2008. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153. 9650

212 CHAPTER 6. SUMMARY AND FUTURE WORK 213

[10] ——, “Efficient Cross-Layer Negotiation,” in ACM HotNets-VIII, 2009. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.320& rep=rep1&type=pdf

[11] A. Habib, N. Christin, and J. Chuang, “Taking advantage of multihoming with session layer striping,” in IEEE International Conference on Computer Communications (INFOCOM). IEEE, 2006, pp. 1–6. [Online]. Available: http://ieeexplore.ieee.org/ lpdocs/epic03/wrapper.htm?arnumber=4146707

[12] J. Salz, A. C. Snoeren, and H. Balakrishnan, “TESLA: A Transparent, Extensible Session- Layer Architecture for End-to-End Network Services,” in Symposium on Internet Tech- nologies and Systems. USENIX, 2003.

[13] C. Dovrolis, “What would Darwin think about clean-slate architectures?” ACM SIGCOMM Computer Communication Review, vol. 38, no. 1, p. 29, Jan. 2008. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1341431.1341436

[14] E. Kohler, “The Click Modular Router,” Ph.D. dissertation, Laboratory for Computer Sci- ence, MIT, February 2001.

[15] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “OpenFlow: Enabling Innovation in Campus Networks,” ACM SIGCOMM Computer Communication Review, vol. 38, no. 2, pp. 69–74, Mar. 2008. [Online]. Available: http://doi.acm.org/10.1145/1355734.1355746

[16] K. Yap, M. Kobayashi, D. Underhill, and S, “The Stanford OpenRoads Deployment,” in ACM international workshop on Experimental evaluation and characterization, ser. WINTECH ’09, 2009, pp. 59–66. [Online]. Available: http://portal.acm.org/citation. cfm?id=1614293.1614304

[17] C. Dovrolis and J. Streelman, “Evolvable network architectures: What can we learn from biology?” ACM SIGCOMM Computer Communication Review, vol. 40, no. 2, pp. 72–77, 2010. [Online]. Available: http://portal.acm.org/citation.cfm?id=1764886

[18] B. Ford, “Structured Streams:ÂaA˘ New Transport Abstraction,” in ACM SIGCOMM Com- puter Communication Review, vol. 37, no. 4. ACM, 2007, pp. 361–372.

[19] T. Mahieu, P. Verbaeten, and W. Joosen, “A Session Layer Concept for Overlay Networks,” Wireless Personal Communications, vol. 35, no. 1-2, pp. 111–121, Oct. 2005. [Online]. Available: http://www.springerlink.com/content/y0276t1822v81581

[20] A. Snoeren and H. Balakrishnan, “An end-to-end approach to host mobility,” in ACM MobiCom. ACM, 2000, pp. 155–166. [Online]. Available: http://portal.acm.org/ citation.cfm?doid=345910.345938 CHAPTER 6. SUMMARY AND FUTURE WORK 214

[21] A. C. Snoeren, H. Balakrishnan, and M. F. Kaashoek, “Reconsidering Internet Mobility,” HotOS-VIII, 2001. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.22.3779

[22] M. A. C. Snoeren, “A Session-Based Architecture for Internet Mobility,” Massachusetts Institute of Technology, Tech. Rep., 2003. [Online]. Available: http://citeseerx.ist.psu. edu/viewdoc/summary?doi=10.1.1.11.7707

[23] A. Brown, M. Swany, E. Kissel, and G. Almes, “Phoebus : A Session Protocol for Dynamic and Heterogeneous Networks,” University of Delaware, Newark, Tech. Rep., 2008. [Online]. Available: damsl.cis.udel.edu/projects/phoebus/phoebus_tech_report.pdf

[24] E. Kissel, M. Swany, and A. Brown, “Improving GridFTP performance using the Phoebus session layer,” High Performance Networking and Computing, 2009. [Online]. Available: http://portal.acm.org/citation.cfm?id=1654059.1654094

[25] B. Landfeldt, T. Larsson, Y. Ismailov, and A. Seneviratne, “SLM, A Framework for Session Layer Mobility Management,” ICCCN, 1999. [Online]. Available: http: //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.6621

[26] E. Nordström, D. Shue, P. Gopalan, R. Kiefer, M. Arye, S. Y. Ko, J. Rexford, and M. J. Freedman, “Serval: An End-host Stack for Service-centric Networking,” in USENIX NSDI, ser. NSDI’12. USENIX Association, 2012.

[27] R. Stewart, “Stream Control Transmission Protocol,” RFC 4960 (Proposed Standard), Internet Engineering Task Force, Sep. 2007, updated by RFCs 6096, 6335, 7053. [Online]. Available: http://www.ietf.org/rfc/rfc4960.txt

[28] H. Han, S. Shakkottai, C. V. Hollot, R. Srikant, and D. Towsley, “Multi-Path TCP: A Joint Congestion Control and Routing Scheme to Exploit Path Diversity in the Internet,” IEEE/ACM Transactions on Networking, vol. 14, no. 6, pp. 1260–1271, Dec. 2006. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber=4032726

[29] D. Wischik, C. Raiciu, A. Greenhalgh, and M. Handley, “Design, Implementation and Evaluation of Congestion Control for Multipath TCP,” in 8th USENIX Conference on Net- worked Systems Design and Implementation (NSDI). USENIX Association, 2011.

[30] C. Paasch and O. Bonaventure, “Multipath TCP,” ACM Communications, vol. 57, no. 4, pp. 51–57, Apr. 2014. [Online]. Available: http://doi.acm.org/10.1145/2578901

[31] D. Wischik, M. Handley, and C. Raiciu, “Control of multipath TCP and optimization of multipath routing in the Internet,” in Network Control and Optimization. Springer Verlag, 2009, pp. 204–218. [Online]. Available: http://www.springerlink.com/index/ 4368QH37W22195Q0.pdf CHAPTER 6. SUMMARY AND FUTURE WORK 215

[32] D. Wischik, M. Handley, and M. Braun, “The resource pooling principle,” ACM SIGCOMM Computer Communication Review, vol. 38, no. 5, pp. 47–52, Sep. 2008. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1452335.1452342

[33] H. Balakrishnan, H. S. Rahul, and S. Seshan, “An Integrated Congestion Management Architecture for Internet Hosts,” in ACM SIGCOMM, 1999.

[34] J. Iyengar and B. Ford, “A Next Generation Transport Services Architec- ture,” RFC-Informational, 2000. [Online]. Available: http://tools.ietf.org/html/ draft-iyengar-ford-tng-00

[35] ——, “Flow Splitting with Fate Sharing in a Next Generation Transport Services Architecture,” Yale University, Tech. Rep., 2009. [Online]. Available: http://citeseerx. ist.psu.edu/viewdoc/download?doi=10.1.1.149.5689&rep=rep1&type=pdf

[36] R. Atkinson, S. Bhatti, and S. Hailes, “Mobility through naming: impact on DNS,” in MobiArch. ACM, 2008, pp. 7–12. [Online]. Available: http://portal.acm.org/citation. cfm?id=1403010

[37] R. Moskowitz and P. Nikander, “Host Identity Protocol (HIP) Architecture,” RFC 4423 (Informational), Internet Engineering Task Force, May 2006. [Online]. Available: http://www.ietf.org/rfc/rfc4423.txt

[38] R. Moskowitz, P. Nikander, P. Jokela, and T. Henderson, “Host Identity Protocol,” RFC 5201 (Experimental), Internet Engineering Task Force, Apr. 2008, obsoleted by RFC 7401, updated by RFC 6253. [Online]. Available: http://www.ietf.org/rfc/rfc5201.txt

[39] R. Moskowitz, T. Heer, P. Jokela, and T. Henderson, “Host Identity Protocol Version 2 (HIPv2),” RFC 7401 (Proposed Standard), Internet Engineering Task Force, Apr. 2015. [Online]. Available: http://www.ietf.org/rfc/rfc7401.txt

[40] D. A. Maltz and P.Bhagwat, “MSOCKS: An Architecture for Transport Layer Mobility,” in IEEE International Conference on Computer Communications (INFOCOM). IEEE, 1998, pp. 1037–1045. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.53.5244

[41] M. Stiemerling, J. Quittek, and T. Taylor, “Middlebox Communication (MIDCOM) Protocol Semantics,” RFC 5189 (Proposed Standard), Internet Engineering Task Force, Mar. 2008. [Online]. Available: http://www.ietf.org/rfc/rfc5189.txt

[42] P. Srisuresh, J. Kuthan, J. Rosenberg, A. Molitor, and A. Rayhan, “Middlebox communication architecture and framework,” RFC 3303 (Informational), Internet Engineering Task Force, Aug. 2002. [Online]. Available: http://www.ietf.org/rfc/ rfc3303.txt CHAPTER 6. SUMMARY AND FUTURE WORK 216

[43] R. P. Swale, P. A. Mart, P. Sijben, S. Brim, and M. Shore, “Middlebox Communications (midcom) Protocol Requirements,” RFC 3304 (Informational), Internet Engineering Task Force, Aug. 2002. [Online]. Available: http://www.ietf.org/rfc/rfc3304.txt

[44] B. Carpenter and S. Brim, “Middleboxes: Taxonomy and Issues,” RFC 3234 (Informational), Internet Engineering Task Force, Feb. 2002. [Online]. Available: http://www.ietf.org/rfc/rfc3234.txt

[45] M. Stiemerling, J. Quittek, and T. Taylor, “Middlebox Communications (MIDCOM) Protocol Semantics,” RFC 3989 (Informational), Internet Engineering Task Force, Feb. 2005, obsoleted by RFC 5189. [Online]. Available: http://www.ietf.org/rfc/rfc3989.txt

[46] T. Anderson, L. Peterson, and S. Shenker, “Overcoming the Internet impasse through virtualization,” in ACM HotNets-III, 2004. [Online]. Available: http: //ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1432642

[47] M. Chiang, S. Low, A. Calderbank, and J. Doyle, “Layering As Optimization Decomposition: Current Status and Open Issues,” in Conference on Information Sciences and Systems. IEEE, Mar. 2006, pp. 355–362. [Online]. Available: http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4067833

[48] ——, “Layering as optimization decomposition: A mathematical theory of network architectures,” Proceedings of the IEEE, vol. 95, no. 1, pp. 255–312, Jan. 2007. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=4118456

[49] I. Baldine, M. Vellala, A. Wang, G. Rouskas, R. Dutta, and D. Stevenson, “A Unified Software Architecture to Enable Cross-Layer Design in the Future Internet,” in ICCCN. IEEE, Aug. 2007, pp. 26–32. [Online]. Available: http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4317792

[50] S. Jordan, “A layered network approach to net neutrality,” International Journal of Communication, vol. 1, pp. 427–460, 2007. [Online]. Avail- able: http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:A+Layered+ Network+Approach+to+Net+Neutrality#0

[51] J. Day, I. Matta, and K. Mattar, “Networking is IPC: a guiding principle to a better Internet,” in ACM CoNEXT, 2008. [Online]. Available: http://portal.acm.org/citation. cfm?id=1544079

[52] M. Sifalakis, A. Louca, and A. Mauthe, “A functional composition framework for autonomic network architectures,” in IEEE NOMS. Ieee, Apr. 2008, pp. 328– 334. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=4509967 CHAPTER 6. SUMMARY AND FUTURE WORK 217

[53] J. Touch and V. Pingali, “The RNA metaprotocol,” in ICCCN. IEEE, Aug. 2008, pp. 1–6. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=4674206

[54] J. Touch, I. Baldine, R. Dutta, G. G. Finn, B. Ford, S. Jordan, D. Massey, A. Matta, C. Papadopoulos, and P. Reiher, “A Dynamic Recursive Unified Internet Design (DRUID),” Computer Networks, vol. 55, no. 4, pp. 919–935, Mar. 2011. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S138912861000383X

[55] F.Teraoka, “Redesigning Layered Network Architecture for Next Generation Networks,” IEEE GLOBECOM, pp. 1–6, Nov. 2009. [Online]. Available: http://ieeexplore.ieee.org/ lpdocs/epic03/wrapper.htm?arnumber=5360742

[56] G. Bouabene, C. Jelger, C. Tschudin, S. Schmid, A. Keller, and M. May, “The autonomic network architecture (ANA),” IEEE Journal on Selected Areas in Communications, vol. 28, no. 1, pp. 4–14, Jan. 2010. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5371088

[57] V.Ishakian, J. Akinwumi, F. Esposito, and I. Matta, “On Supporting Mobility and Multi- homing in Recursive Internet Architectures,” in FutureNet III, 2010.

[58] C. Perkins, “IP Mobility Support for IPv4,” RFC 3220 (Proposed Standard), Internet Engineering Task Force, Jan. 2002, obsoleted by RFC 3344. [Online]. Available: http://www.ietf.org/rfc/rfc3220.txt

[59] M. Atiquzzaman and A. Reaz, “Survey and Classification of Transport Layer Mobility Management Schemes,” in IEEE PIMRC, vol. 4. IEEE, 2005, pp. 2109–2115. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1651818

[60] D. Joseph and I. Stoica, “Modeling middleboxes,” IEEE Network, vol. 22, no. 5, pp. 20–25, Sep. 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=4626228

[61] R. Hancock, G. Karagiannis, J. Loughney, and S. V.den Bosch, “Next Steps in Signaling (NSIS): Framework,” RFC 4080 (Informational), Internet Engineering Task Force, Jun. 2005. [Online]. Available: http://www.ietf.org/rfc/rfc4080.txt

[62] B. Y. K. Srinivasan, “MTCP: Transport Layer Support for Highly Available Network Ser- vices,” Master’s thesis, Rutgers, The State University of New Jersey, 2001.

[63] M. Walfish, J. Stribling, M. Krohn, H. Balakrishnan, R. Morris, and S. Shenker, “Middleboxes no longer considered harmful,” in USENIX OSDI, 2004, pp. 15–15. [Online]. Available: http://portal.acm.org/citation.cfm?id=1251269

[64] J. Rosenberg, R. Mahy, P. Matthews, and D. Wing, “Session Traversal Utilities for NAT (STUN),” RFC 5389 (Proposed Standard), Internet Engineering Task Force, Oct. 2008, updated by RFC 7350. [Online]. Available: http://www.ietf.org/rfc/rfc5389.txt CHAPTER 6. SUMMARY AND FUTURE WORK 218

[65] Google, “SPDY: An experimental protocol for a faster web - The Chromium Projects,” 2011. [Online]. Available: http://www.chromium.org/spdy/spdy-whitepaper

[66] W. Arthur, “Competing Technologies, Increasing Returns, and lock-in by historical Events,” The Economic Journal, vol. 99, no. 394, pp. 116–131, 1989. [Online]. Available: http://www.jstor.org/stable/2234208

[67] M. Honda, Y. Nishida, C. Raiciu, A. Greenhalgh, M. Handley, and H. Tokuda, “Is it Still Possible to Extend TCP?” in Internet Measurement Conference. ACM SIGCOMM, 2011, pp. 181–194.

[68] J. H. Saltzer, D. P. Reed, and D. D. Clark, “End-to-end arguments in system design,” ACM Transactions on Computer Systems, vol. 2, no. 4, pp. 277–288, Nov. 1984. [Online]. Available: http://portal.acm.org/citation.cfm?doid=357401.357402

[69] U. Kalim, M. Gardner, E. Brown, and W.Feng, “SLIM: Enabling Transparent Extensibility and Dynamic Configuration via Session-Layer Abstractions,” in ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), 2017.

[70] ——, “Cascaded TCP: Big Throughput for Big Data Applications in Distributed HPC,” in IEEE/ACM Supercomputing, 2012, Poster.

[71] ——, “Cascaded TCP: Applying Pipelining to TCP for Efficient Communication over Wide-Area Networks,” in IEEE GLOBECOM, December 2013.

[72] U. Kalim, E. Brown, M. Gardner, and W. Feng, “Enabling Renewed Innovation in TCP by Establishing an Isolation Boundary,” in 8th International Workshop on Protocols for Future, Large-Scale and Diverse Network Transports (PFLDNeT), November 2010.

[73] E. Brown, M. Gardner, U. Kalim, and W. Feng, “Restoring End-to-End Resilience in the Presence of Middleboxes,” in International Conference on Computer Communications and Networking (ICCCN), Maui, Hawaii, August 2011.

[74] U. Kalim, M. Gardner, E. Brown, and W. Feng, “Seamless Migration of Virtual Machines Across Networks,” in International Conference on Computer Communications and Net- works, August 2013.

[75] ——, “Cognizant Networks: A Model for Session-Based Communications and Adaptive Networking,” in IEEE Transactions on Networking, 2017, submitted for review.

[76] U. Kalim, M. Gardner, and W. Feng, “A Non-Invasive Approach for Realizing Resilience in MPI,” in Fault Tolerance for HPC at eXtreme Scale (FTXS), held in conjunction with the 26th ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), June 2017.

[77] Google, “QUIC: Quick UDP Internet Connections,” 2012. [Online]. Available: https://www.chromium.org/quic CHAPTER 6. SUMMARY AND FUTURE WORK 219

[78] T. W. Curry and S. M. Inc, “Profiling and Tracing Dynamic Library Usage Via Interposition,” USENIX, pp. 267–278, 1994. [Online]. Available: http://citeseerx.ist. psu.edu/viewdoc/summary?doi=10.1.1.24.8794

[79] H. Zimmermann, “OSI Reference Model — The ISO Model of Architecture for Open Systems Interconnection,” in IEEE Transactions on Communications. IEEE, 1980, pp. 425–432. [Online]. Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp? arnumber=1094702

[80] A. L. Russell, “OSI: The Internet that Wasn’t — How TCP/IP eclipsed the Open Systems Interconnection standards to become the global protocol for computer networking,” Jul. 2013. [Online]. Available: http://spectrum.ieee.org/computing/ networks/osi-the-internet-that-wasnt

[81] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R. Sparks, M. Handley, and E. Schooler, “SIP: Session Initiation Protocol,” RFC 3261 (Proposed Standard), Internet Engineering Task Force, Jun. 2002, updated by RFCs 3265, 3853, 4320, 4916, 5393, 5621, 5626, 5630, 5922, 5954, 6026, 6141, 6665, 6878, 7462, 7463. [Online]. Available: http://www.ietf.org/rfc/rfc3261.txt

[82] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” RFC 3550 (INTERNET STANDARD), Internet Engineering Task Force, Jul. 2003, updated by RFCs 5506, 5761, 6051, 6222, 7022, 7160, 7164. [Online]. Available: http://www.ietf.org/rfc/rfc3550.txt

[83] C. Raiciu, M. Handley, and A. Ford, “Multipath TCP Design Decisions,” Tech. Rep. July, 2009.

[84] A. Ford, C. Raiciu, M. Handley, S. Barre, and J. Iyengar, “Architectural Guidelines for Multipath TCP Development,” RFC 6182 (Informational), Internet Engineering Task Force, Mar. 2011. [Online]. Available: http://www.ietf.org/rfc/rfc6182.txt

[85] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure, “TCP Extensions for Multipath Operation with Multiple Addresses,” RFC 6824 (Experimental), Internet Engineering Task Force, Jan. 2013. [Online]. Available: http://www.ietf.org/rfc/rfc6824.txt

[86] “iOS: Multipath TCP Support in iOS 7,” http://support.apple.com/en-us/HT201373, 2015.

[87] H. Balakrishnan and S. Seshan, “The Congestion Manager,” RFC 3124 (Proposed Standard), Internet Engineering Task Force, Jun. 2001. [Online]. Available: http: //www.ietf.org/rfc/rfc3124.txt

[88] C. Perkins, “IP Mobility Support for IPv4, Revised,” RFC 5944 (Proposed Standard), Internet Engineering Task Force, Nov. 2010. [Online]. Available: http://www.ietf.org/ rfc/rfc5944.txt CHAPTER 6. SUMMARY AND FUTURE WORK 220

[89] C. Perkins, D. Johnson, and J. Arkko, “Mobility Support in IPv6,” RFC 6275 (Proposed Standard), Internet Engineering Task Force, Jul. 2011. [Online]. Available: http://www.ietf.org/rfc/rfc6275.txt

[90] I. Stoica, D. Adkins, S. Zhuang, S. Shenker, and S. Surana, “Internet indirection in- frastructure,” IEEE/ACM Transactions on Networking, vol. 12, no. 2, pp. 205–218, April 2004.

[91] “iOS: Multipath TCP support in iOS7,” 2015. [Online]. Available: https://support. apple.com/en-us/HT201373

[92] G. R. Wright and W.R. Stevens, TCP/IP Illustrated, 1st ed. Addison-Wesley Professional, 1995.

[93] D. Cooper, S. Santesson, S. Farrell, S. Boeyen, R. Housley, and W. Polk, “Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile,” RFC 5280 (Proposed Standard), Internet Engineering Task Force, May 2008, updated by RFC 6818. [Online]. Available: http://www.ietf.org/rfc/rfc5280.txt

[94] P. Vixie, S. Thomson, Y. Rekhter, and J. Bound, “Dynamic Updates in the Domain Name System (DNS UPDATE),” RFC 2136 (Proposed Standard), Internet Engineering Task Force, Apr. 1997, updated by RFCs 3007, 4035, 4033, 4034. [Online]. Available: http://www.ietf.org/rfc/rfc2136.txt

[95] P. Mockapetris, “Domain names - implementation and specification,” RFC 1035 (INTERNET STANDARD), Internet Engineering Task Force, Nov. 1987, updated by RFCs 1101, 1183, 1348, 1876, 1982, 1995, 1996, 2065, 2136, 2181, 2137, 2308, 2535, 2673, 2845, 3425, 3658, 4033, 4034, 4035, 4343, 5936, 5966, 6604. [Online]. Available: http://www.ietf.org/rfc/rfc1035.txt

[96] J. Sermersheim, “Lightweight Directory Access Protocol (LDAP): The Protocol,” RFC 4511 (Proposed Standard), Internet Engineering Task Force, Jun. 2006. [Online]. Available: http://www.ietf.org/rfc/rfc4511.txt

[97] A. S. Tanenbaum and M. V.Steen, Distributed Systems: Principles and Paradigms, 2nd ed. Prentice Hall, 2006.

[98] I. Stoica, R. Morris, D. Karger, M. F.Kaashoek, and H. Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications,” in ACM SIGCOMM, 2001.

[99] U. Kalim, M. Gardner, E. Brown, and W. Feng, “SLIM: A Session-Layer In- termediary for Enabling Multi-Party and Reconfigurable Communication,” http://hdl.handle.net/10919/52933, Department of Computer Science, Virginia Tech, Tech. Rep. TR-15-04, June 2015. CHAPTER 6. SUMMARY AND FUTURE WORK 221

[100] “Demonstration Video of Seamless Virtual Machine Migration,” 2011. [Online]. Available: http://www.cs.vt.edu/~umar/vm-demo

[101] “Introduction to Mellanox Infiniband Architecture,” 2017. [Online]. Available: http://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf

[102] S. Park, H. Park, Y. Won, J. Lee, and S. Kent, “Traceable Anonymous Certificate,” RFC 5636 (Experimental), Internet Engineering Task Force, Aug. 2009. [Online]. Available: http://www.ietf.org/rfc/rfc5636.txt

[103] L. Rizzo, “Dummynet,” Dipartimento di Ingegneria dell’Informazione of the UniversitÃa˘ di Pisa, Italy. [Online]. Available: http://info.iet.unipi.it/~luigi/dummynet/

[104] L. Cottrell, “Ping End-to-End Reporting.” [Online]. Available: http://www-iepm.slac. stanford.edu/pinger/

[105] M. Mathis, J. Semke, J. Mahdavi, and T. Ott, “The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm,” SIGCOMM Computer Communication Review, vol. 27, no. 3, pp. 67–82, 1997.

[106] S. Floyd, T. Henderson, and A. Gurtov, “The NewReno Modification to TCP’s Fast Recovery Algorithm,” RFC 3782 (Proposed Standard), Internet Engineering Task Force, Apr. 2004, obsoleted by RFC 6582. [Online]. Available: http: //www.ietf.org/rfc/rfc3782.txt

[107] E. Nygren, R. K. Sitaraman, and J. Sun, “The Akamai Network: A Platform for High- Performance Internet Applications,” SIGOPS Operating Systems Review, vol. 44, no. 3, 2010.

[108] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, and I. Foster, “The Globus Striped GridFTP Framework and Server,” in ACM/IEEE Supercomputing, 2005.

[109] E. Weigle and W. Feng, “A Comparison of TCP Automatic Tuning Techniques for Dis- tributed Computing,” in IEEE International Symposium on High Performance Distributed Computing, 2002, pp. 265–272.

[110] J. Border, M. Kojo, J. Griner, G. Montenegro, and Z. Shelby, “Performance Enhancing Proxies Intended to Mitigate Link-Related Degradations,” RFC 3135 (Informational), Internet Engineering Task Force, Jun. 2001. [Online]. Available: http://www.ietf.org/rfc/rfc3135.txt

[111] H. Y. Pucha and C. Hu, “Slot: Shortened Loop Internet Transport using Overlay Networks,” Purdue University, Tech. Rep. TR-ECE-5-12, 2005. [Online]. Available: http://docs.lib.purdue.edu/ecetr/66 CHAPTER 6. SUMMARY AND FUTURE WORK 222

[112] L. Xu, K. Harfoush, and I. Rhee, “Binary Increase Congestion Control (BIC) for Fast Long-Distance Networks,” in IEEE INFOCOM, vol. 4. IEEE, 2004, pp. 2514–2524.

[113] S. Ha, I. Rhee, and L. Xu, “CUBIC: A New TCP-Friendly High-Speed TCP Variant,” SIGOPS Operating Systems Review, vol. 42, no. 5, pp. 64–74, 2008.

[114] R. Shorten and D. Leith, “H-TCP: TCP for High-Speed and Long-Distance Networks.” in PFLDnet, 2004.

[115] K. Tan, J. Song, Q. Zhang, and M. Sridharan, “A Compound TCP Approach for High- Speed and Long Distance Networks,” Microsoft Research, Tech. Rep. MSR-TR-2005-86, 2005.

[116] S. Ha, L. Le, I. Rhee, and L. Xu, “Impact of Background Traffic on Performance of High- Speed TCP Variant Protocols,” Computer. Networks, vol. 51, no. 7, pp. 1748–1762, 2007.

[117] S. Bhatti and M. Bateman, “Transport Protocol Throughput Fairness,” Journal of Net- works, vol. 4, no. 9, 2009.

[118] W. Feng, “Long-Haul TCP vs. Cascaded TCP,” Computer Science, Virginia Tech, Tech. Rep. TR-06-04, 2006. [Online]. Available: http://eprints.cs.vt.edu/archive/00000737/

[119] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose, “Modeling TCP Throughput: A Simple Model and its Empirical Validation,” in SIGCOMM. ACM, 1998, pp. 303–314.

[120] A. Tirumala, M. Gates, F. Qin, and J. Dugan, “Iperf - The TCP/UDP Bandwidth Measurement Tool,” National Laboratory for Applied Network Research. [Online]. Available: http://dast.nlanr.net/Projects/Iperf

[121] G. Giacobbi, “Netcat - The TCP/IP Swiss Army.” [Online]. Available: http: //nc110.sourceforge.net/

[122] “PlanetLab,” Princeton University. [Online]. Available: http://www.planet-lab.org

[123] A. Ford, C. Raiciu, M. Handley, and S. Barre, “TCP Extensions for Multipath Operation with Multiple Addresses,” RFC (Experimental), Work in Progress, Internet Engineering Task Force, Jul. 2010.

[124] J. Iyengar and B. Ford, “A Next Generation Transport Services Architecture,” RFC (In- formational), Work in Progress, Internet Engineering Task Force, Jul. 2009.

[125] A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Myers et al., “A Clean Slate 4D Approach to Network Control and Management,” in ACM SIGCOMM CCR, vol. 35, no. 5. ACM, 2005, pp. 41–54.

[126] V. Jacobson, R. Braden, and D. Borman, “TCP Extensions for High Performance,” RFC 1323 (Proposed Standard), Internet Engineering Task Force, May 1992, obsoleted by RFC 7323. [Online]. Available: http://www.ietf.org/rfc/rfc1323.txt CHAPTER 6. SUMMARY AND FUTURE WORK 223

[127] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, “TCP Selective Acknowledgment Options,” RFC 2018 (Proposed Standard), Internet Engineering Task Force, Oct. 1996. [Online]. Available: http://www.ietf.org/rfc/rfc2018.txt

[128] J. Zweig and C. Partridge, “TCP alternate checksum options,” RFC 1146 (Historic), Internet Engineering Task Force, Mar. 1990, obsoleted by RFC 6247. [Online]. Available: http://www.ietf.org/rfc/rfc1146.txt

[129] T. Connolly, P.Amer, and P.Conrad, “An Extension to TCP : Partial Order Service,” RFC 1693 (Historic), Internet Engineering Task Force, Nov. 1994, obsoleted by RFC 6247. [Online]. Available: http://www.ietf.org/rfc/rfc1693.txt

[130] M. Duke, R. Braden, W. Eddy, and E. Blanton, “A Roadmap for Transmission Control Protocol (TCP) Specification Documents,” RFC 4614 (Informational), Internet Engineering Task Force, Sep. 2006, obsoleted by RFC 7414, updated by RFC 6247. [Online]. Available: http://www.ietf.org/rfc/rfc4614.txt

[131] R. Braden, “T/TCP – TCP Extensions for Transactions Functional Specification,” RFC 1644 (Historic), Internet Engineering Task Force, Jul. 1994, obsoleted by RFC 6247. [Online]. Available: http://www.ietf.org/rfc/rfc1644.txt

[132] A. Heffernan, “Protection of BGP Sessions via the TCP MD5 Signature Option,” RFC 2385 (Proposed Standard), Internet Engineering Task Force, Aug. 1998, obsoleted by RFC 5925, updated by RFC 6691. [Online]. Available: http: //www.ietf.org/rfc/rfc2385.txt

[133] J. Touch, A. Mankin, and R. Bonica, “The TCP Authentication Option,” RFC 5925 (Proposed Standard), Internet Engineering Task Force, Jun. 2010. [Online]. Available: http://www.ietf.org/rfc/rfc5925.txt

[134] S. Floyd, M. Allman, A. Jain, and P. Sarolahti, “Quick-Start for TCP and IP,” RFC 4782 (Experimental), Internet Engineering Task Force, Jan. 2007. [Online]. Available: http://www.ietf.org/rfc/rfc4782.txt

[135] D. J. Bernstein, “SYN cookies,” February 2002. [Online]. Available: http://cr.yp.to/ syncookies.html

[136] T. Wood, P.Shenoy, A. Venkataramani, and M. Yousif, “Black-Box and Gray-Box Strate- gies for Virtual Machine Migration,” in USENIX Conference on Networked Systems Design and Implementation (NSDI), 2007.

[137] V.Shrivastava, P.Zerfos, K.-W.Lee, H. Jamjoom, Y.-H. Liu, and S. Banerjee, “Application- Aware Virtual Machine Migration in Data Centers,” in IEEE INFOCOM, 2011.

[138] R. Bradford, E. Kotsovinos, A. Feldmann, and H. Schiöberg, “Live Wide-Area Migration of Virtual Machines Including Local Persistent State,” in SIGPLAN/SIGOPS 3rd Interna- tional Conference on Virtual Execution Environments. ACM, 2007, pp. 169–179. CHAPTER 6. SUMMARY AND FUTURE WORK 224

[139] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, “Live Migration of Virtual Machines,” in USENIX Conference on Networked Systems Design and Implementation. USENIX, 2005, pp. 273–286.

[140] D. Erickson, G. Gibb, B. Heller, D. Underhill, J. Naous, G. Appenzeller, G. Parulkar, N. McKeown, M. Rosenblum, and L. Monica, “A Demonstration of Virtual Machine Mobility in an OpenFlow Network,” in SIGCOMM (Demo). ACM, 2008. [Online]. Available: http://conferences.sigcomm.org/sigcomm/2008/papers/p513-ericksonA. pdf

[141] W. Voorsluys, J. Broberg, S. Venugopal, and R. Buyya, “Cost of Virtual Machine Live Migration in Clouds: A Performance Evaluation,” in CloudCom. Springer-Verlag, 2009, pp. 254–265.

[142] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, M. Bursell, and C. Wright, “VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” Internet Draft (Work in Progress), 2013. [Online]. Available: https://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02

[143] D. Eastlake, A. Banerjee, D. Dutt, R. Perlman, and A. Ghanwani, “Transparent Interconnection of Lots of Links (TRILL) Use of IS-IS,” RFC 6326 (Proposed Standard), Internet Engineering Task Force, Jul. 2011, obsoleted by RFC 7176. [Online]. Available: http://www.ietf.org/rfc/rfc6326.txt

[144] K. Kompella and Y. Rekhter, “Virtual Private LAN Service (VPLS) Using BGP for Auto-Discovery and Signaling,” RFC 4761 (Proposed Standard), Internet Engineering Task Force, Jan. 2007, updated by RFC 5462. [Online]. Available: http://www.ietf.org/rfc/rfc4761.txt

[145] R. Finlayson, T. Mann, J. Mogul, and M. Theimer, “A Reverse Address Resolution Protocol,” RFC 903 (INTERNET STANDARD), Internet Engineering Task Force, Jun. 1984. [Online]. Available: http://www.ietf.org/rfc/rfc903.txt

[146] C. Perkins, “IP Encapsulation within IP,” RFC 2003 (Proposed Standard), Internet Engineering Task Force, Oct. 1996, updated by RFCs 3168, 6864. [Online]. Available: http://www.ietf.org/rfc/rfc2003.txt

[147] “TRILL in the Data Center: Look Before You Leap,” White Paper, Juniper Networks, 2012. [Online]. Available: http://www.juniper.net/us/en/local/pdf/whitepapers/ 2000408-en.pdf

[148] B. Wellington, “Secure Domain Name System (DNS) Dynamic Update,” RFC 3007 (Proposed Standard), Internet Engineering Task Force, Nov. 2000. [Online]. Available: http://www.ietf.org/rfc/rfc3007.txt

[149] C. Constable, Personal communication, May 2011, juniper Networks. CHAPTER 6. SUMMARY AND FUTURE WORK 225

[150] A. C. Snoeren and H. Balakrishnan, “An End-to-End Approach to Host Mobility,” in Pro- ceedings of the 6th Annual International Conference on Mobile Computing and Networking. ACM, 2000, pp. 155–166.

[151] D. Joseph and I. Stoica, “Modeling Middleboxes,” IEEE Network, vol. 22, no. 5, pp. 20– 25, 2008.

[152] D. A. Joseph, A. Tavakoli, and I. Stoica, “A Policy-Aware Switching Layer for Data Cen- ters,” in Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication, ser. SIGCOMM ’08. ACM, 2008, pp. 51–62.

[153] R. Mahy, P. Matthews, and J. Rosenberg, “Traversal Using Relays around NAT (TURN): Relay Extensions to Session Traversal Utilities for NAT (STUN),” RFC 5766 (Proposed Standard), Internet Engineering Task Force, Apr. 2010. [Online]. Available: http://www.ietf.org/rfc/rfc5766.txt

[154] J. Rosenberg, “Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols,” RFC 5245 (Proposed Standard), Internet Engineering Task Force, Apr. 2010, updated by RFC 6336. [Online]. Available: http://www.ietf.org/rfc/rfc5245.txt

[155] F.Sultan, K. Srinivasan, D. Iyer, and L. Iftode, “Migratory TCP: Connection Migration for Service Continuity in the Internet,” in Proceedings of the 22nd International Conference on Distributed Computing Systems, 2002, pp. 469–470.

[156] C. Dixon, A. Krishnamurthy, and T. Anderson, “An End to the Middle,” in USENIX HotOS XII, 2009.

[157] “WX Series Application Acceleration Platforms,” 2015. [Online]. Available: https: //www.juniper.net/us/en/local/pdf/datasheets/1000112-en.pdf

[158] “Cisco Catalyst 6500 Series SSL Services Module,” 2015. [On- line]. Available: http://www.cisco.com/c/en/us/products/interfaces-modules/ catalyst-6500-series-ssl-services-module/index.html

[159] “Firewalld - Fedora Project,” 2017. [Online]. Available: https://fedoraproject.org/ wiki/Firewalld

[160] H. Balakrishnan, K. Lakshminarayanan, S. Ratnasamy, S. Shenker, I. Stoica, and M. Walfish, “A layered naming architecture for the internet,” in Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, ser. ACM SIGCOMM ’04, 2004, pp. 343–352. [Online]. Available: http://doi.acm.org/10.1145/1015467.1015505 CHAPTER 6. SUMMARY AND FUTURE WORK 226

[161] T. Koponen, M. Chawla, B.-G. Chun, A. Ermolinskiy, K. H. Kim, S. Shenker, and I. Stoica, “A data-oriented (and beyond) network architecture,” in Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, ser. ACM SIGCOMM ’07. New York, NY, USA: ACM, 2007, pp. 181–192. [Online]. Available: http://doi.acm.org/10.1145/1282380.1282402

[162] I. P.Egwutuoha, D. Levy, B. Selic, and S. Chen, “A Survey of Fault Tolerance Mechanisms and Checkpoint/Restart Implementations for High Performance Computing Systems,” The Journal of Supercomputing, vol. 65, no. 3, pp. 1302–1326, 2013. [Online]. Available: http://dx.doi.org/10.1007/s11227-013-0884-0

[163] T. Hérault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing. Springer, 2015.

[164] F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir, “Toward Exascale Resilience: 2014 update,” Supercomputing Frontiers and Innovations, vol. 1, no. 1, 2014. [Online]. Available: http://superfri.org/superfri/article/view/14

[165] B. Schroeder and G. Gibson, “A Large-Scale Study of Failures in High-Performance Com- puting Systems,” IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, pp. 337–350, 2010.

[166] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P.Lemarinier, O. Lodygensky, F.Magniette, V.Neri, and A. Selikhov, “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes,” in The International Conference for High Performance Computing, Networking, Storage and Analysis, 2002, pp. 29–29.

[167] S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, “MPI-FT: Portable Fault Tolerance Scheme for MPI,” Parallel Processing Letters, vol. 10, no. 04, pp. 371– 382, 2000. [Online]. Available: http://www.worldscientific.com/doi/abs/10.1142/ S0129626400000342

[168] W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, “Post-Failure Recovery of MPI Communication Capability,” The International Journal of High Performance Computing Applications, vol. 27, no. 3, pp. 244–254, 2013. [Online]. Available: http://dx.doi.org/10.1177/1094342013488238

[169] I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, and B. R. de Supinski, “Evaluating User-Level Fault Tolerance for MPI Applications,” in Proceedings of the 21st European MPI Users’ Group Meeting, ser. EuroMPI/ASIA ’14, 2014, pp. 57:57–57:62. [Online]. Available: http://doi.acm.org/10.1145/2642769.2642775

[170] M. Gamell, D. S. Katz, H. Kolla, J. Chen, S. Klasky, and M. Parashar, “Exploring automatic, online failure recovery for scientific applications at extreme scales,” in ACM/IEEE International Conference for High Performance Computing, Networking, CHAPTER 6. SUMMARY AND FUTURE WORK 227

Storage and Analysis, ser. SC ’14, 2014, pp. 895–906. [Online]. Available: https://doi.org/10.1109/SC.2014.78

[171] I. P. Egwutuoha, S. Chen, D. Levy, and B. Selic, “A Fault Tolerance Framework for High Performance Computing in Cloud,” in 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, ser. CCGRID ’12, 2012, pp. 709–710. [Online]. Available: http://dx.doi.org/10.1109/CCGrid.2012.80

[172] S. Rao, L. Alvisi, and H. M. Vin, “Egida: an extensible toolkit for low-overhead fault- tolerance,” in 29th Annual International Symposium on Fault-Tolerant Computing, 1999, pp. 48–55.

[173] R. Batchu, J. P.Neelamegam, Z. Cui, M. Beddhu, A. Skjellum, Y. Dandass, and M. Apte, “MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middle- ware for Performance-Portable Parallel Computing,” in IEEE/ACM International Sympo- sium on Cluster Computing and the Grid, 2001, pp. 26–33.

[174] W. Gropp and E. Lusk, “Dynamic Process Management in an MPI Setting,” in 7th IEEE Symposium on Parallel and Distributed Processing, 1995, pp. 530–533.

[175] G. E. Fagg, A. Bukovsky, and J. J. Dongarra, “HARNESS and Fault Tolerant MPI,” Parallel Computing, vol. 27, no. 11, pp. 1479–1495, 2001. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167819101001004

[176] G. E. Fagg and J. J. Dongarra, “Building and Using a Fault-Tolerant MPI Implementation,” The International Journal of High Performance Computing Applications, vol. 18, no. 3, pp. 353–361, 2004. [Online]. Available: http://dx.doi.org/10.1177/ 1094342004046052

[177] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, “Algorithm-based Fault Tolerance for Dense Matrix Factorizations,” in 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’12, 2012, pp. 225–234. [Online]. Available: http://doi.acm.org/10.1145/2145816.2145845

[178] Al Geist and Christian Engelmann, “Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors,” 2002. [Online]. Available: http://www.csm.ornl.gov/~geist/Lyon2002-geist.pdf

[179] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message Passing Interface , 2nd ed. MIT Press., 1999.

[180] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1992. CHAPTER 6. SUMMARY AND FUTURE WORK 228

[181] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P.Kambadur, B. Barrett, A. Lumsdaine et al., “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in European Parallel Virtual Machine/Message Passing Interface Users Group Meeting. Springer, 2004, pp. 97–104.

[182] J. M. Squyres and A. Lumsdaine, “The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms,” pp. 167–185, 2005. [Online]. Available: http://dx.doi.org/10.1007/0-387-23352-0_11

[183] “Personal Communication with Mellanox Representative at Supercomputing,” 2016. [Online]. Available: http://sc16.supercomputing.org

[184] W. Gropp and E. Lusk, “Fault Tolerance in Message Passing Interface Programs,” The International Journal of High Performance Computing Applications, vol. 18, no. 3, pp. 363–372, 2004.