Lecture 2
Physical Organization
- Uniform memory access (UMA) machines
P M All memory is equally far away from all processors. I P M
Early parallel processors like NYU Ultracomputer Problem: why go across network for instructions? read-only data? what about caches?
- Non-uniform memory access (NUMA) machines: M P I Access to local memory is usually 10-1000 times P faster than access to non-local memory M
Static and dynamic locality of reference are critical for high performance. Compiler support? Architectural support?
Bus-based symmetric multiprocessors (SMP's): combine both aspects ¡ Logical Organization - Shared Memory Model
x M single P address space P I (conceptual picture) P P P P M
- hardware/systems software provide single address space model to applications programmer - some systems: distinguish between local and remote references
- communication between processors: read/write shared memory locations: put get
- Distributed Memory Model (Message Passing)
M P P P I (conceptual picture) P P M P
- each processor has its own address space - communication between processors: messages (like e-mail) - basic message-passing commands: send receive Key difference: In SMM, P1 can access remote memory locations w/o prearranged participation of application program on remote processor ¢ Message Passing £ Blocking SEND/RECEIVE : couple data transfer and synchronization
- Sender and receiver rendezvous to exchange data P P
SrcP DestP
...... M F M F x : = .... RECEIVE(y,SrcP) SEND(x, DestP) ...... History: Caltech Cosmic Cube
- SrcP field in RECEIVE command permits DestP to select which processor it wants to receive data from - Implementation: - SrcP sends token saying `ready to send' - DestP returns token saying `me too' - Data transfer takes place directly between application programs w/o buffering in O/S - Motivation: Hardware `channels' between processors in early multicomputers - Problem: - sender cannot push data out and move on - receiver cannot do other work if data is not available yet one possibility: new command TEST(SrcP,flag): is there a message from SrcP?
Overlapping of computation and communication is critical for performance ¤ Non-blocking SEND/RECEIVE : decouple synchronization from data transfer
SrcP DestP P P
...... x : = .... RECEIVE(y,SrcP,tag,flag) M F M F SEND(x, DestP,tag) ......
Network - SrcP can push data out and move on - Many variation: return to application program when - data is out on network? - data has been copied into an O/S buffer? - Tag field on messages permits receiver to receive messages in an order different from order that they were sent by SrcP - RECEIVE does not block - flag is set to true by O/S if data was transfered/false otherwise - Applications program can test flag and take the right action - What if DestP has not done a RECEIVE when data arrives from SrcP? - Data is buffered in O/S buffers at DestP till application program does a RECEIVE Can we eliminate waiting at SrcP ? Can we eliminate buffering of data at DestP ? ¥ Asynchronous SEND/RECEIVE
SrcP DestP P P
...... x : = .... IRECEIVE(y,SrcP,tag,flag2) M F M F ISEND(x, DestP,tag,flag1) ......
Network
- SEND returns as soon as O/S knows about what needs to be sent - `Flag1' set by O/S when data in x has been shipped out - Application program continues, but must test `flag1' before overwriting x - RECEIVE is non-blocking: - returns before data arrives - tells O/S to place data in `y' and set `flag' after data is received - `posting' of information to O/S - `Flag2' is written by O/S and read by application program on DestP - Eliminates buffering of data in DestP O/S area if IRECEIVE is posted before message arrives at DestP ¦ So far, we have looked at point-to-point communication
Collective communication: - patterns of group communication that can be implemented more efficiently than through long sequences of send's and receive's
- important ones: - one-to-all broadcast (eg. A*x implemented by rowwise distribution: all processors need x) - all-to-one reduction (eg. adding a set of numbers distributed across all processors) - all-to-all broadcast every processor sends a piece of data to every other processor - one-to-all personalized communication one processor sends a different piece of data to all other processors - all-to-all personalized communication each processor does a one-to-all communication § Example: One-to-all broadcast (intuition: think `tree')
3
2 0 3 7 1 1 Messages in each phase 6 2 do not compete for links 3 3 5 4 2
3
Assuming message size is small, time to send a message = Ts + h*Th where Ts = overhead at sender/receiver Th = time per hop
Total time for broadcast = Ts + Th*P/2 + Ts + Th*P/4 + ..... = Ts * logP + Th*(P-1)
Reality check: Actually, a k-ary tree makes sense because processor 0 can send many messages by the time processor 4 is ready to participate in broadcast ¨ Other topologies: use the same idea
1 2 2 2 2
2-D Mesh
Step 1: Broadcast within row of originating processor
Step 2: Broadcast within each column in parallel
Time = Ts logP + 2Th*(sqrt(P) -1)
© Example: All-to-one reduction
1
2 0 1 7 1 1 Messages in each phase 6 2 do not compete for links 3 1 5 4 2
1
Purpose: apply a commutative and associative operator (reduction operator) like +,*,AND,OR etc to values contained in each node
Can be viewed as inverse of one-to-all broadcast Same time as one-to-all broadcast
Important use: determine when all processors are finished working (implementation of `barrier')
Example: All-to-all broadcast
6 7 0 7
0 0 6 7 5 7 1 1 1 0 6 2 6 2 3 3 5 2 5 1 5 4 4 4 2 4 3 3
- Intuition: cyclic shift register - Each processor receives a value from one neighbor , stores it away, and sends it to next neighbor in the next phase. - Total of (P-1) phases to complete all-to-all broadcast
Time = (Ts + Th) *(P-1) assuming message size is small - Same idea can be applied to meshes as well: - first phase, all-to-all broadcast within each row - second phase, all-to-all broadcast within each column ¡ A Message-passing Program
¢ MPI: Message-Passing Interface
Goal: Portable Parallel Programming for Distributed Memory Computers
- Lots of vendors of Distributed Memory Computers: IBM,NCube, Intel, CM-5, .....
- Each vendor had its own communication constructs => porting programs required changing parallel programs even to go from one distributed memory platform to another!
- MPI goal: standardize message passing constructs syntax and semantics
- Mid 1994: MPI-1 standard out and several implementations available (SP-2) £ Write an MPI program to perform matrix-vector multiply
A b
- Style of programming: Master-Slave - one master, several slaves - master co-ordinates activities of slaves - Master initially owns all rows of A and vector b
- Master broadcasts vector b to all slaves
- Slaves are self-scheduled - each slave comes to master for work - master sends a row of matrix to slave - slave performs product, returns result and asks for more work
- Very naive algorithm, but it's a start.
¤ Key MPI Routines we will use:
MPI_INIT : Initialize the MPI System MPI_COMM_SIZE: Find out how many processes there are MPI_COMM_RANK: Who am I? MPI_SEND: Send a message
MPI_SEND(address,count,datatype,DestP,tag,comm)
permits entire data structures identifies process group to be sent with one command
MPI_RECV: Receive a message (blocking receive) MPI_FINALIZE: Terminate MPI MPI_BCAST: Broadcast
H
?
?
) - - -
!" / /;" " " " !" "
) . . . .
4 4
*5 *5 *5 *5 *5
*
5 2 5
$ $
. ) ' #.
%
"
& &
# #- #- #- #- - :
4 4
% % % % %
2 2
!"
!$ $ A A /
& &
) #-
2
6
% % % % %
2 2 *
<
"
"
!
------% & ) &
'
6 6 6 6
% %
!
- ) ) ) ) & &
4 2
%&
- -
B # B , 0 B B
!
& &
#-
/ /
) ) )
)
& &
' ) *
"
! !
- - .
1
B 4
* 8 8 8 8 I
& &
2
7
! )
& = = =
0
(
B
/ - -
:
*5 ?
B
*5
*
<
!"
1
D
#
1 4
>? >?
8
* * *
-
' F
&
0
& * ? ?
2 2
#)
/ #
. . 5 -
B
3
* * 8 8
<
!" $
"
9
1 B
8
* *
5 5
!"
-
* ,-
*+ 1
C
7
" "
! )
2 E
%
C
A$
)
#- -
5
9 2
2
*G
<
A
#- 1
)
(
- -
5
*
<
< <
'
)
2
1 )
,-
= '- #
# )
@
1
2
"
#
) B B
5 3
0
8
$
-
/
5
-
5
>? @
8
@
$
#.
1 5
8 8
?
<
! #
)
4
3
?
!
/ ) -
)
1
9
1
. # .
4
.
3
5
1
!$
)
-
5
8
. !
-
,
5
8 8
/;$
%
1 4 9 9
1
2
# :
#
4
1 #-
8 0
* 2
8
- -
/
/
* 2 - *
1
@
)
5 2
% 1 1
"
. /;" .
) 4
0 1
-
5 1 5
4
! $
#
) #
%
1
@
=
0
2
2 1
2
3
/
) !
* 2 -
*
A
! "
# . /
)
1 4
?
>
/ .
D
.
?
2
<
$ $
- #
4
%
?
A -
*
)
1
@ 2
2
<
-
)
2 #
8
<
2
$
-
% *
B
)
'
5
@
8
)
(
8
#
1
2
@
J
*
9
3
- @ %
@
.
4
$
4
<
5
-
%
8
B
*5
8
!
-
6
C
-
,
2
@
) "
2
@
-
6
2
.
*
5
L M
H
? ? ? >?
.
N
A
K K
*
<
'
*
5 5
@
" $ $ !"
. . . - - . .
4 4 4
2 2 *5
<
=
5 5 5 5
!
! !
)
5 %
2 *
$ " A
. . . . .
- -
P # L 4 # # 4 0 #
D H
4 4
%
? ? 2 ?
"
"
)
5 5
*
*5 5 *5 *
<
"
A
! A
$ $
, 4
%
*
$
A
)
M &
- # B # B
* - * # - *
% ? ?
4 >
2 %
@
<
)
#. . -
*5
5 * *5 *5
*G
<
.
= ' = = = = =
$ $ 1
* -
8 8
@
-
B '- :
D D
F
- ? ? - &
#
*
" "
1 : 1 1
I ) #
> > D > >
) #.
"
= = # = =
5
8
*
@
!" "
$
. .
5
0
3
<
$ -
/
1 = 1
-
* 5
> >
D %
8
"
.
! )
:
%
1 9 2
*G
@
<
!
9 . / . #-
) 4
A * -
1
2 2
#
=
- )
.
3 3
2
<
"
A
-
$ :
A
%
5
2
"
. .
B )
1
2 2
#.
5
# O
' @
2
<
#
1
%
'- * 3
8
(
"
/
) # -
-
1 4
) 3 *
2
> >
)
. - )
)
D
1
#
)
3
5
@
A
-
-
/
. )
B
* 5
@
8
.
#)
9
2
"
%
8
2
4
B -
1 )
2
*
2
1 0
2
/;"
2
"
,
B
-
#
*
$
)
%
8
3
8
) -
-
2
@
2
B
B
8
1
B
"
!
Q
B
2
#-
)
1
1
1
*
& *
P S
R
? ? ?
K K K
/
. . )
4
5
- -
*5
" " /
- . )
S #
* 2
5
?
'
#
%
5 5
*5
<
.
. - $
5 4
D % %
2 *5
" $ $
. . .
4
- D #
5
4 - *
% %
*
<
"
$
A
) ' #
5
*5
! A
= #
$
5 %
<
A$
!
# B - B B 1
R
- % -
? 2
2
<
) #
= 1
- -
3
2 >
2 *5 5 2
5
. -
$
' #
8 8 8
5 5
.
- D
' B
- 2 #
0
-
4 T
"
"
#
/
:
) -
# ) -
4
=
& #
5
%
1 8 )
<
$
#
1
0
2
T
@
<
1
/
%
" -
2
>
D
8
@ @
< <
# :-
&
1 1 9
2
5
@ >
. = = = 1
B )
*
<
"
)
=
5 5
1
2
<
" $ $
.
1
5 A$
% >
8
B
B
1
2 T
! !
: 1
5 2 > '
@
8
<
$
- -
-
8
2 ' 2 2
-
=
B
*
)
5 5
@
)
!
1
8
#
1 )
B
3
- B # #
O O
9 8
T
2
-
5
.
5
8
@
8
> >
O
# 9
@
%
1
<
1
1 2 *
>
@
2
1 1
D
"
#
@
B
8
- -
$
#
2 *
)
5
8
)
1
2
-
4
B
)
@
-
* 1
B
)
)
.
4
1
8
5
@
-
U
B #
B
0
/
-
8
B
1
1
8
-
2
5
-
4
)
1 1
V
H
? >? ?
? ?
- -
K K K
"
5 2 5
#.
- -
4 4
%
" "
/
2
5 *
%
*
"
'
% %
'
"
-
4
% %
2
"
. . . .
-
, 4 6 B 6
% 5 %
*
<
'
5
2
"
"
#
%
8 I
2
2
) . .
# #
B B B # B
5
B
? %
>
-
= $
#
?
*5
$
? 2
8 8 8 8
- V B B
& H
-
? ?
2
=
*
? ? B
<
"
8 8
*
=
,- 5
B
C
8
8
<
O 1
A$
9 2 >
8
< <
A
1
A -
1 1
*
A$ .
"
) '
8
<
) -
9 '
5
# .
3
'
% )
1
@
-
&
B
-
'
1 1
2
2
@
-
%
5 )
-
1
*
8
2 1
.
) 1
F
&
>
<
.
)
*
-
* %
@
! . )
B
4
1
2
@
/;$ )
B
A
5
8
-
8
9
<
#-
8
@
*
B
#
@
0
.
4
/
-
8
1
#
@
2
8
"
/
)
# #
3
8
. -
$
#
5
$
4
B
8
1
B
2
B
#
B
-
*
B
2
$
)
1
)
5
1 B
W
"
!
#
@
X
!"
*
1
% 2
#-
2
!"
.
4
) #-
)
5 1
-
1 2
-
#
"
)
!
-
5
1
2 2
"
#
2
0
/
- -
6 1
"
- ) )
!" / /;" " " " / !"
) . . ) . . . .
4 4 4
*5 *5 *5 *5 *5
* 5 5
$ $
. ) '
5
"
# #- #- #- #- : - :
4
% % % %
*5 2 2
/
)
!$ $ A A
) # # #-
5 2
6
% % % %
* 2 2
<
$
#.
"
! B
------)
' # #
6 6 6 6 0 0
% %
/ /
- ) ) ) )
- -
4 *5 0
%&
! /
B B B 1
! /
#- -
8
/ /
) ) )
/
) #
' )
"
! !
- - .
1 A
B . .
* 8 8 8
7
!
1
======
0
(
)
T
/ )
:
?
B
*5
*
<
!"
1 . M
D H
#
1 4
> >? >?
8 #
* * *
2
&
'
0
& ? ?
2 2 T
<
$
#)
/ -
. . 5 -
3
* * 8 8
<
!" $
%
9
1 B
8
5 5
-
#
B
* &
*+ 1 2
2
7
" "
! )
#
2 E
C 2
A$
)
#- -
5 3
9 8
2
"
1
) H
(
5
T
<
< <
'
2
1
@
,-
= '- # ,-
#
@
1
!
#
) B B
5 3
0
8
"
$
-
/ 0
5
-
>?
8
8
$
#.
1 5 9 *
8 8
?
<
! #
)
1 4
?
!
/ )
)
1
9
1
. -
4
. %
1
!$
"
)
5
1 T
8
.
.
-
,
8 8
%
1 4 9 9
2
1
2
'
# :
# #
>
1
8 0
* 2
8
- -
/ 1
H
/
* 2 -
1 @
T
@
)
5
% 1 1
"
/
. )
)
5
0 1
$
.
-
5 1 5
4
! $
)
%
1
@
!
=
-
0
2
2 1
2
! /
)
* 2
*
A
! "
# /
) .
1 4
?
>
2
/ .
D
.
-
?
2
<
$
- #
%
?
A -
*
1
2
2Y
)
1
@ 2
2
<
-
)
2
8
T
<
T
-
% *
B
)
'
*
@
8
)
2
(
8
W
"
1
2
@
9
3
%
- @
*
@
,-
$
4
<
%
8
T
B
*5
8
6
C
-
,
@
)
@
2
.
*
5
L M
H
? ? ? >?
K K K
*
'
5 5
" $ $ !"
. . . - . - .
4 4 4
2 2 *5
<
5 5 5 5
!
! !
)
5 %
2 *
$ " " A
- . . . . .
- -
P # L 4 # 4 # 0 #
D H
4 4
2
%
? ? 2 ?
"
"
)
5 5 5
*
*5 5 *5 *
<
"
A
! A
$ $
, 4
% %
*
$
A
)
M &
- - # # B B
4 * - * # - *
% % ? ?
4 >
2 %
@
<
"
)
#. - .
)
*5
5 * *5 *5
*G
<
.
= ' = = = = =
$ $ 1
5
* -
8 8
@
-
B B '- :
D D
F
- ? ? - &
#
*
" "
: = 1 1 1
) #
2 > > D > >
) #.
"
= = # # = =
5
8 8
*
@
!" "
$
. .
2 5
0 0
3
<
" $ -
#
/ /
1 = 1
- -
* 5
> >
D %
8
"
.
! )
:
%
1 1 2 9
*G
@
<
$ !
# . / . #-
) 4
A * -
1
2 2
#
=
- )
" .
3 3
2
< <
"
A
-
$ :
A
%
2 5
5
2
"
. .
B B )
1
2 2
<
5
# O
' 1 @
2 2
<
#
1
%
'- * 3
8 8
B
(
/
) # -
-
1 4
) 3 *
2
> >
1 . - )
)
> D
1
8
#
)
3
5
@
-
-
/
. )
B
5
@
8 8
.
B
9 9
2
%
8
2
4
-
1
8
2
*
@
2
1 1
2
/;"
2
B
-
#
* 2
"
)
#
8
3
8
)
-
8
$
#
@
2
1
B
2
B
B
-
*
B
8
)
)
1
1
B
@
W
"
!
W
B
B
2
#-
)
8
1
1
1
*
8
1
P S V
R
? ? ? ?
-
K K K K
%
-
-
2
/ " "
. . )
4
2
*
"
'
% %
5
"
- -
-
% %
*5 2
" /
. ) . .
-
, S # 4 6
*
5 % 5 %
?
<
'
#
%
5
2
*5
<
"
.
"
. - $ #
5 4
D %
2 *5 2
2
" $ $
. . . ) .
#
4
D # B B B B
5 5
- *
% ?
>
*
<
$
-
A = $
' # #
5
?
*5
! A
= #
$
%
? 2
8 8 8 8
<
A$
!
# - B B - 1
R & H
- % -
? 2
?
2 2
<
) #
1 =
- -
3 *
>
2 *5 5 2 ? ?
5
<
"
. -
$
'
8 8
5 5
*
. =
- D
' B
- # ,- 5
-
4 T B
8
"
:
)
# ) -
4
8
<
= O 1
& #
5 A$
%
8 ) 9 2 >
8
<
<
A
1
1
0 A
1 1
2
T
@
<
1 A$ .
/
%
-
2
> '
D
8
@ @
<
# :-
&
-
1 1 9 '
@ >
# .
. = = = 1
)
3 *
'
% )
1
@
"
)
-
&
= B
5 5
-
'
1
2
<
" $ $
. -
1
5 A$ %
% > 5 )
-
B
1
*
8
1 2
2 T
! ! .
: )
F
&
5 2 '
@
<
<
$
.
- - )
-
8
2 ' 2 2
*
-
=
B
* %
)
5 5
@
@
! ) .
! B
4
1
2
1
@
#
/;$ 1 )
3
- B # # A
O O 5
8
T
2
8
-
-
8
9
5
<
.
#-
5
@
8
*
> >
B
#
O
# 9
@
%
0
<
.
4
/
1
-
8
1 2 * 1
>
@
#
1 1
D
@
2
8
@
"
/
)
# #
B
3
- -
. -
2 *
) $
#
5
8
5
) $
4
B
8
-
4
1
2
)
B
@
1 #
-
*
B
2
$
)
1
.
)
4
5
B
8
5
W
"
!
- #
@
Z
*
1
#
% 2
B
#-
0
!"
.
4
/ )
-
B
5 1
1
-
2
-
#
1
)
!
-
1
-
2
5 2
"
2
-
4
) -
6 1
H
>? ?
? ?
- -
K
"
5 2 5
#.
-
4 4
%
/
5
%
*
'
4
"
. .
B 6
*
'
5
%
8 I
.
#
#
B
%
*5
$
V B B
-
?
B
8 8
C
<
-
*
"
)
8
<
)
9
5
1
2
@
1
1
>
-
*
)
)
B
8
@
8
B
B
1
W[
!"
2
#-
)
1
"
5
2
#
0
/
- 1
z
g
^
\] w
m
b
_
b
d
m
e
e
j
g
^
f
_
g
`
e j
j m
^
j
_` n
n
l
|
acb
p
d
o
^
d
^m
g j
b
de
e
_
a
n
^
_
b
k
e
d
_
e
{
~
f
d q
j
h
d
_
gih
^m
n
l
`
]
e
d x
h
j
l
h
j
e
n
n
b
d
b
k
m
g
`
h
e
j
b
d
s
` j
n
n
^
|
l
u
d
g
`
^
e
`
]
}
d
f
j
^
r
e
_
d
_
k
_ }
d
s
b
h
j
q
`
n j
h
p
_
e
l
^
g
b
l
~
l
s
b
d t
d
n
o
^m
e
h
W
k
o
j
`
h
s
^
w
m
dy
_n
r
o
h
u
q
b
s p
b
`
~
^
}
y
}
do
}
d
r
t
j
}
v
h
j
d
m
m
`
n
v
]
g
g
e
e e
j
j
m
k
k
h
h
j
j
`
]
l
l
d
l
_
^
y
jz
~
t
^m
s
k
`
_
`
e
]
`
^ _