Motivation
u Locks, semaphores, monitors are good but they only COS 318: Operating Systems work under the shared-address-space model l Threads in the same process l Processes that share an address space Message Passing u We have assumed that processes/threads communicate via shared data (counter, producer-consumer buffer, …)
u How to synchronize and communicate among processes with different address spaces? (http://www.cs.princeton.edu/courses/cos318/) l Inter-process communication (IPC)
u Can we have a single set of primitives that work for all cases: single machine OS, multiple machines same OS, multiple machines multiple OS, distributed?
With No Shared Address Space Sending A Message
u No need for explicit mutual exclusion primitives l Processes cannot touch the same data directly Within A Computer Across A Network u Communicate by sending and receiving explicit messages: Message Passing P1 P2 Send() Recv() u Synchronization is implicit in message passing Send() Recv() l No need for explicit mutual exclusion Network l Event ordering via sending and receiving of messages OS Kernel OS OS u More portable to different environments, though lacks COS461 some of the convenience of a shared address space
u Typically, communication in consummated via a send P1 can send to P2, P2 can send to P1 and a matching receive
4
1 Simple Send and Receive Simple API
send( dest, data ), receive( src, data ) send( dest, data ), receive( src, data ) S S R u Destination or source l Direct address: send(dest, data) node Id, process Id recv(src, data) l Indirect address: send(dest, data) mailbox, socket, R channel, … u Data l Buffer (addr) and size recv(src, data) u Send “data” specifies where the data are in sender’s address space l Anything else that u Recv “data” specifies where the incoming message data should be specifies the source put in receiver’s address space data or destination data structure 5 6
Simple Semantics Issues/options u Send call does not return until data have been copied u Asynchronous vs. synchronous out of source data structure u Buffering of messages l Could be to destination process/machine, or to OS buffer, … (there are variants depending on this) u Matching of messages
l Source data structure can be safely overwritten without u Direct vs. indirect communication/specification changing the data carried by the message u Data alone, or function invocation? u Receive does not return until data have been received u How to handle exceptions (when bad things happen)? and copied into destination data structure u This is called Synchronous Message Passing l Makes synchronization implicit and easy
l But processes wait around a lot, so can hurt performance
2 Synchronous vs. Asynchronous Send Synchronous vs Asynchronous Receive u Synchronous u Synchronous send( dest, msg) Msg transfer resource l Will not return until data are l Return data if there is a copied out of source data message Msg transfer resource structure l Block on empty buffer recv( src, msg ) l If a buffer is used for messaging and it is full, block status = async_recv( src, msg ); u Asynchronous if ( status == SUCCESS ) u Asynchronous status = async_send( dest, msg ) l Return data if there is a consume msg; l Return before data are copied … message while ( probe(src) != HaveMSG ) out of source data structure if !send_complete( status ) l Return status if there is no wait for msg arrival wait for completion; l Completion message (probe) recv( src, msg ); • Applications must check status … use msg data structure; consume msg; • Notify or signal the application … l Block on full buffer
Buffering Synchronous Send/Recv Within a System
u No buffering Synchronous send: l Sender must wait until the u Call send system call with M receiver receives message u Send system call: l Rendezvous on each msg l No buffer in kernel: block send( M ) recv(M ) l Copy M to kernel buffer u Finite buffer Synchronous recv: l Sender blocks on buffer full u Call recv system call M u Recv system call: l No M in kernel: block buffer l Copy to user buffer How to manage kernel buffer?
On distributed machines/OSes, buffers at one/both ends 12
3 What if Buffers Fill Up? Direct Addressing Example
u Make processes wait (can be hard to do when they Producer(){ Consumer(){ are on different machines) ...... while (1) { for (i=0; i Indirect Addressing Example Indirect Communication u Names Producer(){ Consumer(){ ...... l mailbox, socket, channel, … while (1) { for (i=0; i u Would it work with multiple producers and 1 consumer? u Would it work with 1 producer and multiple consumers? u What about multiple producers and multiple pipe consumers? 15 16 4 Mailbox Message Passing Example: Keyboard Input u Message-oriented 1-way communication u Interrupt handler u Data structure l Get the input characters and give to device thread u Device thread l Mutex, condition variable, buffer for messages l Generate a message and send it to mailbox of an input process u Operations l Init, open, close, send, receive, … u Does the sender know when receiver gets a message? while (1) { P(s); Acquire(m); V(s); convert … … getchar() Release(m); }; mbox mbox_recv(M) Interrupt Device mbox_send(M) handler thread 17 18 Sockets Network Socket Address Binding u Sockets u A network socket binds to socket l Bidirectional (unlike mailbox) send send/ u Host: IP address /recv recv l Unix domain sockets (IPC) u Protocol: UDP/TCP l Network sockets (over network) Kernel u Port: ports l Same APIs u Well known ports (0..1023), u Two types e.g. port 80 for Web UDP/TCP protocols l Datagram Socket (UDP) u Unused ports available for 128.112.9.1 address clients (1025..65535) • Collection of messages socket • Best effort u Why ports? • Connectionless • Indirection: No need to know which process to communicate with l Stream Socket (TCP) • Updating software on one side • Stream of bytes (like pipe) won’t affect another side • Reliable • Connection-oriented 19 20 5 Communication with Stream Sockets Sockets API Client Server u Create and close a socket l sockid = socket(af, type, protocol); Create a socket l sockerr = close(sockid); Bind to a port u Bind a socket to a local address Create a socket l sockerr = bind(sockid, localaddr, addrlength); Listen on the port u Negotiate the connection Establish l listen(sockid, length); Connect to server Accept connection connection l accept(sockid, addr, length); request u Connect a socket to destimation Send request Receive request l connect(sockid, destaddr, addrlength); reply Receive response Send response u Message passing l send(sockid, buf, size, flags); … l recv(sockid, buf, size, flags); 21 22 Unix pipes What if things go bad? u An output stream connected to an input stream by a u R waits for a message from S, chunk of memory (a queue of bytes). but S has terminated l R may be blocked forever u Send (called write) is non-blocking S R u Receive (called read) is blocking u S sends a message to R, but R has terminated u Buffering is provided by OS l S has no buffer and will be S R blocked forever 24 6 Exception: Message Loss Exception: Message Loss, contd. u Use ack and timeout to detect u Retransmission must handle and retransmit a lost message l Duplicate messages on receiver side l Receiver sends an ack for each msg l Out-of-sequence ack messages on send sender side l Sender blocks until an ack message send1 u Retransmission is back or timeout S ack R status = send( dest, msg, timeout ); l Use sequence number for each ack1 message to identify duplicates l If timeout happens and no ack, then S send2 R l Remove duplicates on receiver side retransmit the message ack2 l Sender retransmits on an out-of- u Issues sequence ack l Duplicates u Reduce ack messages l Losing ack messages l Bundle ack messages l Piggy-back acks in send messages 25 26 Exception: Message Corruption Message Passing Interface (MPI) u A message-passing library for parallel machines Data x CRC l Implemented at user-level for high-performance computing l Portable Compute checksum u Basic (6 functions) u Detection l Works for most parallel programs l Compute a checksum over the entire message and send u Large (125 functions) the checksum (e.g. CRC code) as part of the message l Blocking (or synchronous) message passing l Recompute a checksum on receive and compare with the l Non-blocking (or asynchronous) message passing checksum in the message l Collective communication u Correction u References l Trigger retransmission l http://www.mpi-forum.org/ l Use correction codes to recover 27 28 7 Remote Procedure Call (RPC) RPC Model u Make remote procedure calls Caller (Client) Server l Similar to local procedure calls l Examples: SunRPC, Java RMI Request message RPC call including arguments u Restrictions l Call by value l Call by object reference (maintain consistency) Function execution l Not call by reference w/ passed arguments Reply message u Different from mailbox, socket or MPI Including a return value l Remote execution, not just data transfer Return u References (same as local calls) l B. J. Nelson, Remote Procedure Call, PhD Dissertation, 1981 l A. D. Birrell and B. J. Nelson, Implementing Remote Compile time type checking and interface generation Procedure Calls, ACM Trans. on Computer Systems, 1984 29 30 RPC Mechanism Summary u Message passing Client program Server program l Move data between processes Return Call Call Return l Implicit synchronization l Many API design alternatives (Socket, MPI) l Indirection is helpful Client Decode Encode/ Server Decode Encode/ stub unmarshall marshall stub unmarshall marshall u Implementation and Semantics l Synchronous method is most common RPC RPC l Asynchronous method provides overlapping, but required Receive Send Receive Send runtime runtime careful design and implementation decisions l Indirection makes implementation flexible ClientId RPCId Call Args l Exception needs to be carefully handled u RPC Reply Results l Remote execution like local procedure calls l With constraints in terms of passing data 31 32 8 Hello World using MPI #include "mpi.h" #include int main( int argc, char *argv[] ) { Initialize MPI Return int rank, size; environment my rank Appendix: MPI_Init( &argc, &argv ); Message Passing Interface (MPI) MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize(); return 0; Last call to } clean up Return # of processes 33 34 Blocking Send Blocking Receive u MPI_Send(buf, count, datatype, dest, tag, comm) u MPI_Recv(buf, count, datatype, source, tag, comm, l buf address of send buffer status) l count # of elements in buffer l buf address of receive buffer (output) l datatype data type of each send buffer element l count maximum # of elements in receive buffer l dest rank of destination l datatype datatype of each receive buffer element l tag message tag l source rank of source l comm communicator l tag message tag u This routine may block until the message is received by l comm communicator the destination process l status status object (output) l Depending on implementation u Receive a message with the specified tag from the l But will block until the user source buffer is reusable specified comm and specified source process u More about message tag later u MPI_Get_count(status, datatype, count) returns the real count of the received data 35 36 9 More on Send & Recv Buffered Send u MPI_Bsend(buf, count, datatype, dest, tag, Comm = X comm) Tag = 0 l buf address of send buffer MPI_Send(…, l count # of elements in buffer MPI_Bsend(buf, …) dest=1, tag=1, comm=X…) Tag = 1 l Datatype type of each send element MPI_Recv( …, l dest rank of destination Tag = … Source=0,tag=1,comm=X…) l tag message tag l comm communicator Buffer u May buffer; user can use the user send u Can send from source to destination directly buffer right away Created by MPI_Buffer_attach() u Message passing must match u MPI_Buffer_attach(), MPI_Buffer_detach creates and destroy the buffer l Source rank (can be MPI_ANY_SOURCE) l Tag (can be MPI_ANY_TAG) u MPI_Ssend: Returns only when matching l Comm (can be MPI_COMM_WORLD) receive posted. No buffer needed. u MPI_Rsend: assumes received posted already (programmer’s responsibility) 37 38 Non-Blocking Send Non-Blocking Recv u MPI_Isend(buf, count, datatype, u MPI_Irecv(buf, count, datatype, dest, tag, comm, *request) MPI_Isend(…) dest, tag, comm, *request, ierr) MPI_Irecv(…) l request is a handle, used by other Work to do u Return right away Work to do calls below u MPI_Wait() u Return as soon as possible MPI_Wait(…) MPI_Wait(…) l Block until finishing receive l Unsafe to use buf right away u MPI_Test() u MPI_Wait(*request, *status) MPI_Isend(…) l Return status MPI_Probe(…) l Block until send is done while ( flag == FALSE) { Work to do u MPI_Probe(source, tag, comm, u MPI_Test(*request, flag, status, ierror) More work *flag,*status) MPI_Test(…, flag,…); l Is there a matching message? } l Return the status without blocking while ( flag == FALSE) { MPI_Irecv(…) More work or MPI_recv(…) } 39 40 10