Motivation

u Locks, semaphores, monitors are good but they only COS 318: Operating Systems work under the shared-address-space model l Threads in the same l Processes that share an address space u We have assumed that processes/threads communicate via shared data (counter, producer-consumer buffer, …)

u How to synchronize and communicate among processes with different address spaces? (http://www.cs.princeton.edu/courses/cos318/) l Inter-process communication (IPC)

u Can we have a single set of primitives that work for all cases: single machine OS, multiple machines same OS, multiple machines multiple OS, distributed?

With No Shared Address Space Sending A Message

u No need for explicit primitives l Processes cannot touch the same data directly Within A Across A Network u Communicate by sending and receiving explicit messages: Message Passing P1 P2 Send() Recv() u Synchronization is implicit in message passing Send() Recv() l No need for explicit mutual exclusion Network l Event ordering via sending and receiving of messages OS Kernel OS OS u More portable to different environments, though lacks COS461 some of the convenience of a shared address space

u Typically, communication in consummated via a send P1 can send to P2, P2 can send to P1 and a matching receive

4

1 Simple Send and Receive Simple API

send( dest, data ), receive( src, data ) send( dest, data ), receive( src, data ) S S R u Destination or source l Direct address: send(dest, data) Id, process Id recv(src, data) l Indirect address: send(dest, data) mailbox, socket, R channel, … u Data l Buffer (addr) and size recv(src, data) u Send “data” specifies where the data are in sender’s address space l Anything else that u Recv “data” specifies where the incoming message data should be specifies the source put in receiver’s address space data or destination data structure 5 6

Simple Semantics Issues/options u Send call does not return until data have been copied u Asynchronous vs. synchronous out of source data structure u Buffering of messages l Could be to destination process/machine, or to OS buffer, … (there are variants depending on this) u Matching of messages

l Source data structure can be safely overwritten without u Direct vs. indirect communication/specification changing the data carried by the message u Data alone, or function invocation? u Receive does not return until data have been received u How to exceptions (when bad things happen)? and copied into destination data structure u This is called Synchronous Message Passing l Makes synchronization implicit and easy

l But processes wait around a lot, so can hurt performance

2 Synchronous vs. Asynchronous Send Synchronous vs Asynchronous Receive u Synchronous u Synchronous send( dest, msg) Msg transfer resource l Will not return until data are l Return data if there is a copied out of source data message Msg transfer resource structure l Block on empty buffer recv( src, msg ) l If a buffer is used for messaging and it is full, block status = async_recv( src, msg ); u Asynchronous if ( status == SUCCESS ) u Asynchronous status = async_send( dest, msg ) l Return data if there is a consume msg; l Return before data are copied … message while ( probe(src) != HaveMSG ) out of source data structure if !send_complete( status ) l Return status if there is no wait for msg arrival wait for completion; l Completion message (probe) recv( src, msg ); • Applications must check status … use msg data structure; consume msg; • Notify or the application … l Block on full buffer

Buffering Synchronous Send/Recv Within a System

u No buffering Synchronous send: l Sender must wait until the u Call send system call with M receiver receives message u Send system call: l Rendezvous on each msg l No buffer in kernel: block send( M ) recv(M ) l Copy M to kernel buffer u Finite buffer Synchronous recv: l Sender blocks on buffer full u Call recv system call M u Recv system call: l No M in kernel: block buffer l Copy to user buffer How to manage kernel buffer?

On distributed machines/OSes, buffers at one/both ends 12

3 What if Buffers Fill Up? Direct Addressing Example

u Make processes wait (can be hard to do when they Producer(){ Consumer(){ are on different machines) ...... while (1) { for (i=0; i

Indirect Addressing Example Indirect Communication

u Names Producer(){ Consumer(){ ...... l mailbox, socket, channel, … while (1) { for (i=0; i

u Would it work with multiple producers and 1 consumer? u Would it work with 1 producer and multiple consumers? u What about multiple producers and multiple pipe consumers?

15 16

4 Mailbox Message Passing Example: Keyboard Input

u Message-oriented 1-way communication u Interrupt handler u Data structure l Get the input characters and give to device u Device thread l Mutex, condition variable, buffer for messages l Generate a message and send it to mailbox of an input process u Operations l Init, open, close, send, receive, … u Does the sender know when receiver gets a message? while (1) { P(s); Acquire(m); (s); convert … … getchar() Release(m); }; mbox

mbox_recv(M) Interrupt Device mbox_send(M) handler thread

17 18

Sockets Address Binding

u Sockets u A network socket binds to socket l Bidirectional (unlike mailbox) send send/ u Host: IP address /recv recv l domain sockets (IPC) u Protocol: UDP/TCP l Network sockets (over network) Kernel u Port: ports l Same u Well known ports (0..1023), u Two types e.g. port 80 for Web UDP/TCP protocols l Socket (UDP) u Unused ports available for 128.112.9.1 address clients (1025..65535) • Collection of messages socket • Best effort u Why ports? • Connectionless • Indirection: No need to know which process to communicate with l Socket (TCP) • Updating on one side • Stream of bytes (like pipe) won’t affect another side • Reliable • Connection-oriented

19 20

5 Communication with Stream Sockets Sockets API

Client u Create and close a socket l sockid = socket(af, type, protocol); Create a socket l sockerr = close(sockid);

Bind to a port u Bind a socket to a local address Create a socket l sockerr = bind(sockid, localaddr, addrlength); Listen on the port u Negotiate the connection Establish l listen(sockid, length); Connect to server Accept connection connection l accept(sockid, addr, length); request u Connect a socket to destimation Send request Receive request l connect(sockid, destaddr, addrlength); reply Receive response Send response u Message passing l send(sockid, buf, size, flags); … l recv(sockid, buf, size, flags); 21 22

Unix pipes What if things go bad?

u An output stream connected to an input stream by a u R waits for a message from S, chunk of memory (a queue of bytes). but S has terminated l R may be blocked forever

u Send (called write) is non-blocking S R u Receive (called read) is blocking u S sends a message to R, but R has terminated u Buffering is provided by OS l S has no buffer and will be S R blocked forever

24

6 Exception: Message Loss Exception: Message Loss, contd.

u Use ack and timeout to detect u Retransmission must handle and retransmit a lost message l Duplicate messages on receiver side l Receiver sends an ack for each msg l Out-of-sequence ack messages on send sender side l Sender blocks until an ack message send1 u Retransmission is back or timeout S ack R status = send( dest, msg, timeout ); l Use sequence number for each ack1 message to identify duplicates l If timeout happens and no ack, then S send2 R l Remove duplicates on receiver side retransmit the message ack2 l Sender retransmits on an out-of- u Issues sequence ack l Duplicates u Reduce ack messages l Losing ack messages l Bundle ack messages l Piggy-back acks in send messages

25 26

Exception: Message Corruption Message Passing Interface (MPI)

u A message-passing for parallel machines Data x CRC l Implemented at user-level for high-performance computing l Portable Compute checksum u Basic (6 functions)

u Detection l Works for most parallel programs l Compute a checksum over the entire message and send u Large (125 functions) the checksum (e.g. CRC code) as part of the message l Blocking (or synchronous) message passing l Recompute a checksum on receive and compare with the l Non-blocking (or asynchronous) message passing checksum in the message l Collective communication u Correction u References l Trigger retransmission l http://www.mpi-forum.org/ l Use correction codes to recover

27 28

7 Remote Procedure Call (RPC) RPC Model

u Make remote procedure calls Caller () Server l Similar to local procedure calls l Examples: SunRPC, Java RMI Request message RPC call including arguments u Restrictions l Call by value l Call by object reference (maintain consistency) Function execution l Not call by reference w/ passed arguments Reply message u Different from mailbox, socket or MPI Including a return value l Remote execution, not just data transfer Return u References (same as local calls) l B. J. Nelson, Remote Procedure Call, PhD Dissertation, 1981 l A. D. Birrell and B. J. Nelson, Implementing Remote Compile time type checking and interface generation Procedure Calls, ACM Trans. on Computer Systems, 1984

29 30

RPC Mechanism Summary

u Message passing Client program Server program l Move data between processes Return Call Call Return l Implicit synchronization l Many API design alternatives (Socket, MPI) l Indirection is helpful Client Decode Encode/ Server Decode Encode/ stub unmarshall marshall stub unmarshall marshall u Implementation and Semantics l Synchronous method is most common RPC RPC l Asynchronous method provides overlapping, but required Receive Send Receive Send runtime runtime careful design and implementation decisions l Indirection makes implementation flexible ClientId RPCId Call Args l Exception needs to be carefully handled u RPC Reply Results l Remote execution like local procedure calls l With constraints in terms of passing data 31 32

8 Hello World using MPI

#include "mpi.h" #include

int main( int argc, char *argv[] ) { Initialize MPI Return int rank, size; environment my rank Appendix: MPI_Init( &argc, &argv ); Message Passing Interface (MPI) MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize(); return 0; Last call to } clean up Return # of processes

33 34

Blocking Send Blocking Receive

u MPI_Send(buf, count, datatype, dest, tag, comm) u MPI_Recv(buf, count, datatype, source, tag, comm, l buf address of send buffer status) l count # of elements in buffer l buf address of receive buffer (output) l datatype data type of each send buffer element l count maximum # of elements in receive buffer l dest rank of destination l datatype datatype of each receive buffer element l tag message tag l source rank of source l comm communicator l tag message tag u This routine may block until the message is received by l comm communicator the destination process l status status object (output) l Depending on implementation u Receive a message with the specified tag from the l But will block until the user source buffer is reusable specified comm and specified source process u More about message tag later u MPI_Get_count(status, datatype, count) returns the real count of the received data

35 36

9 More on Send & Recv Buffered Send

u MPI_Bsend(buf, count, datatype, dest, tag, Comm = X comm) Tag = 0 l buf address of send buffer MPI_Send(…, l count # of elements in buffer MPI_Bsend(buf, …) dest=1, tag=1, comm=X…) Tag = 1 l Datatype type of each send element MPI_Recv( …, l dest rank of destination Tag = … Source=0,tag=1,comm=X…) l tag message tag l comm communicator Buffer u May buffer; user can use the user send u Can send from source to destination directly buffer right away Created by MPI_Buffer_attach() u Message passing must match u MPI_Buffer_attach(), MPI_Buffer_detach creates and destroy the buffer l Source rank (can be MPI_ANY_SOURCE) l Tag (can be MPI_ANY_TAG) u MPI_Ssend: Returns only when matching l Comm (can be MPI_COMM_WORLD) receive posted. No buffer needed. u MPI_Rsend: assumes received posted already (programmer’s responsibility) 37 38

Non-Blocking Send Non-Blocking Recv

u MPI_Isend(buf, count, datatype, u MPI_Irecv(buf, count, datatype, dest, tag, comm, *request) MPI_Isend(…) dest, tag, comm, *request, ierr) MPI_Irecv(…) l request is a handle, used by other Work to do u Return right away Work to do calls below u MPI_Wait() u Return as soon as possible MPI_Wait(…) MPI_Wait(…) l Block until finishing receive l Unsafe to use buf right away u MPI_Test() u MPI_Wait(*request, *status) MPI_Isend(…) l Return status MPI_Probe(…) l Block until send is done while ( flag == FALSE) { Work to do u MPI_Probe(source, tag, comm, u MPI_Test(*request, flag, status, ierror) More work *flag,*status) MPI_Test(…, flag,…); l Is there a matching message? } l Return the status without blocking while ( flag == FALSE) { MPI_Irecv(…) More work or MPI_recv(…) }

39 40

10