msocket: Multiple Stack Support for the Berkeley Socket API

Renzo Davoli Michael Goldweber Computer Science Department Dept. of Mathematics and Computer Science University of Bologna - Italy Xavier University - USA [email protected] [email protected]

ABSTRACT protocol family. Our discussion of socketpair can be found The de-facto standard for network programming, the Berke- in Section 3. ley socket API, supports several protocol families. Unfortu- The original API was designed to support a wide range nately, it has a significant limitation in only allowing a single of domains, services and protocols. However, it supports implementation for each supported protocol family. Hence, at most one implementation for each domain/type/protocol using Berkeley sockets, it is impossible to access multiple dis- assignment. tinct networking stacks for the same protocol, e.g. multiple For example: TCP/IP stacks. This paper defines, msocket, an extension fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); to the Berkeley socket API which overcomes this limitation. msocket has been implemented as a feature of the View-OS defines a socket for a TCP stream connection. Observe that project. Finally, we illustrate the utility and effectiveness of if there are multiple distinct TCP/IP stacks available there is our extended API by providing some examples of its use. no way to specify which stack to use for the communication stream. Categories and Subject Descriptors The core idea of this proposal is to augment the socket API with the following additional function: D.4.4 [Operating Systems]: Communication Management— Network communication; .2.2 [Computer-Communication int msocket(char *stack, Networks]: Network Protocols—Applications int domain, int type, int protocol);

1. INTRODUCTION msocket, in addition to the socket parameters, allows one to specify, via a new first argument, which stack to use. The Berkeley socket API is the de facto standard for net- The reminder of the paper is organized as follows: Sec- work programming. A fundamental concept of the Berkeley tion 2 provides some examples of applications motivating API is that network communication endpoints are repre- the need for having/accessing multiple networking stacks. sented in the API by file descriptors. Virtually all of the Section 3 details the msocket API definition and its bi- API’s functions use file descriptors to identify individual nary compatibility with existing applications. In section 4 communication endpoints. The API function that defines we present the current proof-of-concept implementation of a new endpoint descriptor is the socket function. msocket in View-OS. Finally, in sections 5 and 6 we discuss The syntax of the Berkeley socket call is: related work, our conclusions and future directions. int socket(int domain, int type, int protocol); 2. APPLICATION DOMAINS FOR THE where domain specifies the communication domain, i.e. the protocol family to use. (e.g. PF_INET for IPv4, PF_INET6 for MSOCKET API IPv6, or PF_IRDA for irda) type indicates the communica- The following (non exhaustive) list includes descriptions tion semantics. (e.g. stream or datagram). The final argu- of various domains of application where the availability of ment, protocol, which is protocol family specific, specifies multiple networking stacks either outright permits or simply the protocol to use when the same semantics can be provided eases the implementation of useful networking services. by different protocols. In addition to socket, there is also socketpair, which is usually defined only for the PF_UNIX Experimental networking stacks running on remote machines: The use of one stack both as the target and the con- trol channel perturbates the results and can partition the remote machine whenever the experimental stack Permission to make digital or hard copies of all or part of this work for malfunctions. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies Support of different network requirements: For example, bear this notice and the full citation on the first page. To copy otherwise, to a stack used for LAN-based communication may not republish, to post on servers or to redistribute to lists, requires prior specific require overly large buffers to support TCP’s sliding permission and/or a fee. SAC'12 March 25-29, 2012, Riva del Garda, Italy. window protocol while TCP streams on satellite chan- Copyright 2011 ACM 978-1-4503-0857-1/12/03 ...$10.00. nels require large sliding windows.

588 Being able to support several stacks means that indi- IoTh requires the support for, and an API for the use vidual stacks can be algorithmically parameterized for of multiple stacks: msocket. specific purposes. Define permissions on stacks: Different users may be granted differing networking services, e.g. differing QoS levels, 3. MSOCKET DEFINITION access to various ip addresses, or the routing of net- msocket work traffic along different network paths. Currently, has the following syntax: one must define specific filters (e.g. iptables on GNU- int msocket(char *stack, ) to define differing network services for users. int domain, int type, int protocol); Using the msocket API, there can be several available stacks, each with its own interfaces, ip addresses, rout- In systems, stack is a UNIX-special file name (i.e. ing services, and access permissions. Since network pathname). In non-UNIX systems, the same interface could stacks are UNIX-special files defined in the standard be used to identify a kernel object. Since the Berkeley file system, access permissions are defined/controlled socket API is currently adopted by non-UNIX based op- in the usual manner. (i.e. chmod or ACL) Hence the erating systems, our proposed msocket extensions should be msocket API, using the security model inherited from easily port-able to other environments. the file system, allows a system administrator to con- Backward compatibility is provided by the definition of trol which users/groups have access to which stacks. default stack(s). Each process has a default stack defined for each protocol family. If the stack argument is NULL Network sandboxing: This is a special case of the previ- the socket is defined using the default stack for the protocol ous example. If a user lacks permission to access/use family of the domain argument. any networking stack, she has no way to generate any We redefine the socket system call in terms of msocket network traffic. as follows: Define different domains of protection/levels of security:A int socket(int domain, int type, int protocol) { user may need to use differing network accesses simul- return msocket(NULL, domain, type, protocol); taneously. For example, a user may need to concur- } rently use a VPN and her local network. She might use the VPN to access sensitive company data and the A file descriptor created via a call to socket will use the local network to read some domotics parameters of her default stack defined for the address family specified. De- room/home. Using a single stack this is only possible fault stack definitions get inherited through the process cre- by defining filtering rules (strictly an administrative ation fork(2) and execution execve(2) methods. procedure and not meant for user configuration). Al- The definition of msocket should appear natural to UNIX ternatively, if there are two stacks, one can be dedi- programmers as it extends socket via a leading pathname cated for use by the VPN and the other for the local argument. network. The setup can be isomorphic to the straight- The choice to add msocket as a new system call means the forward GUI print dialog for selecting which printer to syntax of socket is unmodified, thus ensuring binary com- use to satisfy a word processor print directive. Sim- patibility for existing applications. An alternative approach ilarly, a user may want to define several networking would redefine socket to have a variable number of argu- tunnels and compare how a geo located web service ments. Not only are variable parameter system calls rare provides different answers depending on the location (since they lead to code with a lower degree of readability), of the client. but whenever a system call requires a pathname as an argu- ment, it is virtually always the first argument. We believe Transparent use of compatible implementations: New pro- that a well designed API should be informed by the most tocols may provide compatible services sharing the same common cases when deciding on parameter order so that the addressing scheme of existing ones. An example of this usage of the API can appear both natural and familiar to is the Socket Direct Protocol (SDP) [7] which provides programmers. the same service as TCP. Similarly, applications can While it is natural to define a networking stack as a UNIX- use the Reliable Datagram Service (RDS) instead of special file, stacks unfortunately cannot be classified using UDP. Via the msocket API, existing applications can any of the existing categories of UNIX-special files (block, choose other compatible services instead of the “stan- character, fifo, etc). We therefore propose, as defined for dard” ones by specifying the appropriate networking st_mode in stat(2)), a new UNIX-special file type for stack stack from the set of available stacks. files: Implementation of an “ of Threads:” [4] In the #define S_IFSTACK 0160000 original design of IP, the addressable nodes were the networking adapters. While this model is still dom- Furthermore, our proposed stack UNIX-special files can only inant, nodes today are also virtual interfaces, virtual be used by msocket (e.g. not open (2)). machines, or one out of many addresses assigned to There are two main reasons to use msocket instead of an adapter supporting the splitting of services. (e.g. open. The former is a technical reason; a stack typically To migrate the services between nodes of a high avail- provides several different protocols/services. For example a ability cluster.) In an Internet of Threads (IoTh), pro- TCP-IP stack provides not only the datagram service and cesses, threads, or sets of processes can now be ad- the stream service, but also netlink (PF_NETLINK) services dressable Internet “nodes.” Any implementation of an for configuration and possibly direct access to the underlying 589 network (PF_PACKET). The use of open would necessitate the [1]$ um_add_service umnet umnet init definition of several special files (one per family/protocol), or [2]$ mount -t umnetlwipv6 none /dev/net/s1 the definition of tags to support service configuration. [3]$ mount -t umnetlwipv6 -o tp0=tapx none /dev/net/s2 This is related to the second reason for using msocket instead [4]$ mstack /dev/net/s1 ip addr of open. 1: lo0: mtu 0 The Berkeley socket interface is widely used and accepted link/loopback by the community of network programmers. The interface inet6 ::1/128 scope host inet 127.0.0.1/8 scope host includes some specific input output functions such as send 2: vd0: mtu 1500 and recv. Furthermore, in some contexts, it is also possible link/ether 02:02:63:d9:4b:06 brd ff:ff:ff:ff:ff:ff to use standard system calls like read or write. For exam- inet6 fe80::2:63ff:fed9:4b06/64 scope link ple, a file descriptor returned by a socket can be used in [5]$ mstack /dev/net/s2 ip addr a recv (or in a read) call, while a file descriptor returned 1: lo0: mtu 0 by open cannot. Our msocket proposal solves the need to link/loopback inet6 ::1/128 scope host access several different stacks, while minimizing changes to inet 127.0.0.1/8 scope host the API, allowing current programming practices to con- 2: tp0: mtu 1500 tinue unchanged; the advantage of maintaining backward link/ether 02:02:03:04:05:06 peer ff:ff:ff:ff:ff:ff compatibility. The final remaining issue with our msocket definition is the assignment of default stacks for processes. For this, we Figure 1: View-OS support of “msockets” propose to overload our msocket system call. When type is SOCK_DEFAULT, msocket does not define a communication $ umview bash endpoint, but instead defines the default stack for the call- $ kmview bash ing process for the specified family (domain arg) or for all $ umview xterm protocol families if domain is PF_UNSPEC. $ kmview xterm If type is SOCK_DEFAULT: • the path of the UNIX-special file (stack) must be non- Fig. 1 shows an example of a View-OS session where a user NULL, has two networking stacks available simultaneously. Com- mand [1] loads the network virtualization umnet service. • msocket returns 0 on success or −1 in case of error, View-OS uses the mount system call to define a network stack (commands [2] and [3]). umnetlwipv6 is a network • errno is ENOENT if the specified UNIX-special file stack based on the View-OS lwipv6 library - an implemen- does not exist, EACCES if the requested access to the tation of an IPv4/IPv6 hybrid stack as a library. stack file is not allowed, or the permission is denied in The syntax and semantics of the mount command in this one of the directories in the pathname. In this same example are part of the View-OS project. While the View- vein, ELOOP, ENAMETOOLONG, ENODEV are de- OS mount command is not part of our msocket API pro- fined as in open (2). EINVAL indicates that the path posal, it is described here simply to explain the example and does not refer to a UNIX-special stack file. to provide the reader with a complete replicable experiment. Sockets can also be defined by socketpair. The msocket View-OS also provides a user-level mstack command that API also includes a multi-stack definition for this system redefines the default stack and then invokes exec using a call: specified command. mstack is View-OS’s user-mode inter- SOCK_DEFAULT msocket int msocketpair(char *stack, face to the feature of the proposed int domain, int type, int protocol, int sv[2]); system call. Commands [4] and [5] in Fig. 1 show the output of As before, stack is the UNIX-special stack filename or NULL, ip addr evaluated on the two different networking stacks. where NULL indicates the use of the default stack for domain. The ip command uses the Berkeley Socket API to configure Figure 2 in the Appendix shows an example use of msocket: the network proper using the AF_NETLINK protocol family. The implementation of a TCP forwarding tool that can con- The ip addr example simply shows that a user can run nect a client and a server reachable by different stacks. The standard commands (e.g. ip) on different stacks. Further- code, while complete, omits any error control for purposes more, in View-OS, interfaces can be configured and enabled of clarity in the presentation. by standard commands. For example, the following com- mands configure the /dev/net/s1 stack of Fig. 1: MSOCKET 4. IMPLEMENTATION IN VIEW-OS $ mstack /dev/net/s1 ip addr add 1.2.3.4/24 dev vd0 The partial virtual machines defined by the View-OS [6] $ mstack /dev/net/s1 ip link set vd0 up project support the modular virtualization of many differ- ent system components including file systems, devices, and The other stack can be configured in a similar way. networking. Standard networking clients can use either of the available There are currently two implementations of View-OS par- stacks: tial virtual machines: umview, which runs on standard Linux $ mstack /dev/net/s1 ssh remote.machine.org kernels, and kmview which needs the utrace[8] support as $ mstack /dev/net/s2 Mail -s "mail through s2"

6. CONCLUSIONS AND FUTURE DEVEL- OPMENTS We have introduced an extension of the Berkeley socket API for the support of multiple stacks. The core ideas of this proposal are:

• the msocket system call,

• the naming of the network stack via a UNIX-special file (mapped on the file system), and

• the backward compatibility for all applications using socket given by the concept of default stacks.

We have provided a proof-of-concept implementation us- ing View-OS partial virtual machines to show both the ef- fectiveness of our approach and to give an idea of the wide range of applications for msocket. Further investigation is needed to ensure that the msocket API is the most effective we can make it. In particular, ad- ditional investigation is needed with regard to our overload- ing the msocket call to define process specific default net- working stacks. Perhaps an additional system call is more appropriate. Finally, other user-space utilities can be in- vestigated/designed in conjunction with msocket, including GUI dialogues for graphical programs to select the stack or stacks to work on.

7. REFERENCES [1] Authorless. Cisco openfabrics enterprise distribution infiniband host drivers user guide for linux. Technical report, Cisco OL-10778-01, 2006. [2] E. W. Biederman. ns: Introduce the setns syscall. https://lkml.org/lkml/2011/5/6/411, 2011. [3] dan Hildebrand. An architectural overview of qnx. In Proceedings of the Workshop on Micro-kernels and Other Kernel Architectures, pages 113–126, 1992. [4] R. Davoli. Internet of threads. Communication at the Conferenza Garr 2011 (in Italian) http://www.garr.it/a/conf11/, 2011. [5] P. Emelyanov. net: Implement socketat. http://lwn.net/Articles/407615/, 2010. 592 #include #include #include #include #include #include #include #include #include "msocket.h"

/* This is a TCP packet forwarding program. usage prog stack1 address1 port1 stack2 address2 port2 e.g. prog /dev/net/1 192.168.100.1 1111 /dev/net/2 192.168.102.2 2222 all tcp connections for 192.168.100.1 port 1111 on stack /dev/net/1 get forwarded to 192.168.102.2 port 2222 using stack /dev/net/2

error control has been omitted (... comments) to avoid details out of the scope of this explanation */ int main(int argc, char *argv[]) { int sockin, sockinc; int rv; struct addrinfo *addr1,*addr2; /* argc/argv consistency check ... */ if ((argv[2],argv[3],NULL,&addr1) < 0 || getaddrinfo(argv[5],argv[6],NULL,&addr2) < 0) exit (-1); sockin=msocket(argv[1],addr1->ai_family,SOCK_STREAM,IPPROTO_TCP); /* ... */ rv=bind(sockin,addr1->ai_addr,addr1->ai_addrlen); /* ... */ rv=listen(sockin,5); /* ... */ for (;;) { int sockout; sockinc=accept(sockin,NULL,0); /* ... */ sockout=msocket(argv[4],addr2->ai_family,SOCK_STREAM,IPPROTO_TCP); /* ... */ if (connect(sockout,addr2->ai_addr,addr2->ai_addrlen) >= 0) { char buf[BUFSIZ]; struct pollfd pfd[]={{sockinc,POLLIN,0},{sockout,POLLIN,0}}; for (;;) { int n; poll(pfd,2,-1); /* ... */ if (pfd[0].revents & POLLIN) { if ((n=read(sockinc,buf,BUFSIZ)) <= 0) break; write(sockout,buf,n); } if (pfd[1].revents & POLLIN) { if ((n=read(sockout,buf,BUFSIZ)) <= 0) break; write(sockinc,buf,n); } } close(sockout); } /* else ... */ close(sockinc); } }

Figure 2: An example of msocket usage: A TCP forwarder across different stacks

593