June 03- 06, 2002 -DAC, Pune

S

h

a

r

e

d

M

e

m

o

r

y

P Programming:

Shared Memory Programming:

r

o g An Introduction to

An Introduction to

r

a

m m

Day – 2 Classroom Lecture 2 – Day

i

n

g

A

n

I

n

t

r o

PCOPP-2002

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Pthreads Pthreads Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 1

June 03- 06, 2002 C-DAC, Pune

S

h

a r

       e

Following Topics will be discussed Lecture Outline

d

M

e m

tra Performance issues using Threads and Processes - Pthread Performance issues - Pthread Debugging tools functions for Synchronization, Synchronization tools, Pthread implementation Pthreads Understanding Examples of Threaded Programs Designing Threaded Programs What is model

o

r

y

P

hrdMmr rgamn Pthreads Shared Memory – Programming

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 2

June 03- 06, 2002 C-DAC, Pune

S

h

a

r e

progressed as much as message-passing model Why? Parallel programming based on shared-memory model has not

d

  

M

e m

Platform independent shared memory programming models : portable.idea on MPPs/PVPS/Clusters. not Now language for multiprocessors (mostly SMP’s) such programs are in a written platform-specific are programs memory Shared Lack of a widely accepted standard such as MPI or PVM. o

  

r

y

P

r o

memory languages. influenced thehas design of several commercial shared The X3H5 standard has not gained wide acceptance, but to extend C to a shared memory parallel language. The SGI power C uses a small setofstructured constructs X3H5, Pthreads, and

g

r

a

m

m i

Shared Memory Programming

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t o

Open MP

P

t

h

r

e

a

d s . Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 3

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d

     

M

e

m o

a cluster an MPP or even a SMP overhead and run more slowly than a messageon passing one variableSharedprograms could incur higher interaction model. Shared-variable model is thethannot easier message passing writes of variables. However synchronization is explicit Communication is done implicitly through shared reads and Workload allocated can be either explicitly or implicitly passing model) is multithreading and asynchronousIt (Similar to message- address space, thus does not have to be explicitly allocated It has a single address space and data resides in single shared

r y

Explicit Parallelism : Shared Variable Model

Explicit Parallelism : Shared Variable Model

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 4

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d

M e

   

m

o

r

y

P

complicated compared to MPI etc. complicated calls into the operating system kernel and is quite involve of sharedsynchronization data, which also may The cost of inter- communication and It takes memory (The entire process must be replicated.) mechanism will become involved. activity, the operating system’s context-switching If the process creation triggers process rescheduling resources and required. Creating a new process can be expensive. More

r

o

g

r a

Why Use Threads Over Processes?

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 5

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d

M

   

e

m

o

r

y

staying within the user address space of the program Threads can synchronize by simply monitoring a variable - space. kernel than rather space user Some work (not all) of the creating thread can be done in process. Threads can be created without replicating an entire involves trapping into the kernel. system calls, a relatively expensive operation that issue to have usually they synchronize, processes When

P

r o

Why Use Threads Over Processes?

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, Contd.. 2002 2002 6

June 03- 06, 2002 C-DAC, Pune S



h a

Performance issues-Between using Threads /processes Performance issues-Between r

Threads and processes are alike in many respects

e

d

M e

  

m

o

r y semaphores access to shared data, the mulitprocess server uses System V server. multiprocess two the multi-threaded server are more efficient than those used by Synchronization : using threads intercommunicate created,they use morethan resources threadsto Creation

Where the multi-threaded server uses mutex locks to control control to locks mutex uses server multi-threaded the Where

Overhead :

P

r

o

g

r

a

m

m

i

n g

: Processes are more expensive to create, and once

– A

Using processes results in more overhead than

n

I

n

t

r

o

d

u c

The synchronization mechanisms used by

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 7

June 03- 06, 2002 C-DAC, Pune

S

h a

Performance issues-Between using Threads /processes Performance issues-Between

r

e

d

   

M

e m

controlled by the operating system processes must use pipes or special shared memory segments placing it in global variables in their process’s address space, Sharing Data : of creating threads is in the servers relative costs multiprocess the multithreaded and server, regardless of the number of clients. The difference between Cost : and involve the operating system’s kernel. server’s locking calls are system calls Multiprocess space. because the account data, the multithreaded server operates more efficiently Contention :

o

r

y

P

r

o

g r

h ut-heddsre uprom h multiprocess The multi-threaded server outperforms the

a

m

m

i

n

g

trasmutex Pthreads –

When there is little contention among threads for

A

n

Whereasexchange threads by simply data

I

n

t

r o

versus versus

d

u

c

t

i

o

n

t o

creating processes

P t

-locking calls operate within user

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 8

June 03- 06, 2002 C-DAC, Pune S

   

h

a

r e

Example access to resources is very easy and is usually automatic. ofthreads are part Becauseall the same process, co-coordinating distributed sharedimplicit memory (DSM) synchronization are much fasteror thanexplicit either with possible All threads share a commonaddressspace, so communication and SMP Threads are usually the preferred way to parallelize codes on an

d



M

e m

automatic coherency with respect to file position and synchronization I/O is All threads use the same file table, so sharing a file and keeping

o

r

y

P r

Shared Memory Programming : Threads

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 9

June 03- 06, 2002 C-DAC, Pune

S

h a

   Critical Region  

r

e

d

M

are four interesting four are Critical regions can be created with mutual exclusion locks. There execute at a time A critical regionthreadoneat most is a blockcan of code which Simplest mechanism for synchronization is called critical region communicate with each other through global variables have a common address space, it is fast and easy for them to Because all of the threads are in thereforesamewill process,and tasks Threads may communicate to share work, synchronize or do other

e

m

o

r

y

P r

Shared Memory Programming : Threads

o

g

r

a

m

m

i

n

g

A

n

I n

mutex

t

r

o

d

u

c

t

i

o

n

t o

system calls system

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 10

June 03- 06, 2002 C-DAC, Pune

S

h a



r

e

d

retrieve messages retrieve in a is regionexamplemodule of critical An where threads try to M



e m Remark

around code to retrieve a message : given message at the sameregion time so you would put a critical block at the point where the critical region is entered one will be allowed to retrieve the message and the other will o

In general you do not want several threads trying to retrieve a

r

y

P ev rtclrgo alwother (allow region queue critical message Leave queue into • message pointers from Update • message Get region • critical Enter •

Shared Memory Programming : Threads

r o

hed oenter) to threads

g

r

a

m m :

If two threads simultaneouslymessage, tryaretrieve to

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 11

June 03- 06, 2002 C-DAC, Pune S

  Benefits:  

h

a

r e

add overheadadd without any performance benefit. Introducing threads in an application that can’t use concurrency, you’ll their ability to concurrently execute tasks. Major benefit of multi threaded programsthreaded over non in onesis distributed sharedimplicit memory (DSM) synchronization are much fasteror thanexplicit either with possible All threads share a commonaddressspace, so communication and to resources is very easy andautomatic. is usually Because all threads areprocess,the same part of coordinating access SMP. Threads are usually the preferred way to parallelize codes on an d

certain amount of overhead. 

M e

In providing concurrency, multithreaded programs introduce a

m

o

r

y

P

r o

Shared Memory Programming : Threads

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 12

June 03- 06, 2002 C-DAC, Pune

S h

  

a

r

e d

concurrently A multithreaded process may have many threads of central running with the execution: Processes from other systems typically own everything associated on other systems A process on a multithreaded system is moreprocess a complex than

M

     

e m

It is difficult to have only one copy of certain resources. registersand everything else Priority working directory file description address space

o

r

y

P

r

o

g

r

a

m

m

i Threads Parallel Programming

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 13

June 03- 06, 2002 C-DAC, Pune

S h

 

a

r

e d

they, are using threaded parallelism. Users only need to be concerned withprocess the structureis a of A process is divided into three parts

M

  

e m

composed. information and resourcesthe ofwhich whole process is Each part of the process includes and manages part of the (LWPs) each LWP may host one or many threads processes weight light many or of one consist may process A process. the of that is unique within the process or must be know by all members contains global information The highest level, called the process –

o

r

y

P

r

o

g

r

a

m m

Threads Parallel Programming

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 14

June 03- 06, 2002 C-DAC, Pune

S h

  

a

r

e d

Resources with thread scope include Resources with LWP scope include Resources with process level scope include

           

M

e m

CPU state Registers, including counter and stack pointer Signal work Thread priority CPU state Kernel stack Signal mask for the currently executing thread Kernel priority Any resource that is necessarily process wide in space Working directory File description space Address

o

r

y

P

r

o

g

r a

Parallel programming threadswith

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 15

June 03- 06, 2002 C-DAC, Pune S

  

h

a

r e

By default incur a context switch switching from one thread to anothernot do they because are fast Because threads are user-level object, thread operations such as A thread is a user-level concept that is invisible to the kernel

d

     

M

e m

is one-one. It is possible to create bound threads, in which case the mapping (LWPs) Threadsare mapped many-to-many to light weight processes Threads cannot complete system-wide for resources Threads have un-describable blocking behavior Threads are not scheduled for CPUs Threads are not visible to the kernel

o

r

y

P

r

o

g

r

a m

Parallel Programming Threads with

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 16

June 03- 06, 2002 C-DAC, Pune

S

h a

    LWPs

r

e

d

M mapping In general, LWPs are mapped to CPUs with a many-to-many they may incur a context switch LWP operations oftenthanlonger takethread operations because time and memory kernel must schedule the LWP for system resources likes CPU In order to run, a thread must be assignedthethen to an LWP and are visible to the kernel LWPs



e

m o

mappings between threads and LWPs one-to-onemany-to-oneandof have a mix processes may A

r

y

P

r

o

g

r a

Parallel Programming Threads with

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 17

June 03- 06, 2002 C-DAC, Pune

S

h

a

r e

eain ewe hed n LWPs Relations between threads and

d

M

e

m

o

r y

Thread

P

r

o

g

r a

Parallel Programming Threads with

m

m i

Thread

n

g

LWP

A

n

LWP

I

n

t r

Thread

o

d

u

c

t

i

o

n

t

o

Kernel P

LWP

t

h

r e

Thread

a

d s LWP Thread Copyright Thread LWP PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 18

June 03- 06, 2002 C-DAC, Pune

S h

   

a

r e

Common Performance Problems with Shared Problems with Memory Common Performance d Common Performance Problems with Shared Problems with Memory Common Performance

Excessive communication Frequent synchronization Load balancing Cost of communication in shared address space machines

M

       

e m

False data mapping False sharing accesses memory remote of number Large Barriers, locks, … synchronizationconstructsImplicit of parallel Uneven work in parallel sections Uneven scheduling of parallelloops local or non-local data. Costs are associated with read and write operations that may be

o

r

y

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 19

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d



einn heddPorm-Boss/Worker Model Designing Threaded Programs-

M

e

m o

following criteria: Identify a task that is suitable for threading by applying to it the r

Remark     

y

P

r o

the application importance work has greater or lesser Its than other in work It must respond to asynchronous events It can use a lot of CPU cycles It can become blocked in waits potentially long It is independent of other tasks for threading ideal applications are managers, file servers, or print servers -

g

r

a

m

m i

:

n g

eea rgas such as those written for database Several programs -

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 20

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d

M

einn heddPorm-Boss/Worker Model Designing Threaded Programs-

e Remarks      

m

o r

servers, file servers, window managers) The boss/worker model works well with servers (database for it to waits necessary, finish The boss creates each worker thread, assigns it tasks, and if more Based on that input, the A single thread, the boss, accepts input for the entire program. and communications are encapsulated in the boss. The complexities of dealing with asynchronously arriving requests One thread is in charge of work assignments for the other threads boss and workers communicate. It is important that you minimize the frequency with which the

y

P

r

o

g

r a

worker m

:

m

i

n

g

– A

threads

n

I

n

t

r

o

d

u

c

t

i

o

n

t o

boss boss

P

t

h

r

e a

passes off specific tasks to one or tasks off passes specific

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 21

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d

M

e Remark :    

m o

einn heddPorm-Peer Models Designing Threaded Programs- r

other peers. way of obtaining its input, or shares a single point of input with A peer knows its own input ahead of time, has its own private The peer model makes each thread responsible for its own input. to finish. processes requests, or suspends itself waiting for the other peers This thread subsequently acts as just another peer thread that the other peer threads when the program starts. It is also known as work crew model, one thread must create all leader. All threads work concurrently on their tasks without a specific data base search engines,and primenumbergenerators. fixed well-defined set of inputs, such as matrix multipliers, parallel

y

P

r

o

g

r

a m

The peer model is suitable for applications that have a

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 22

June 03- 06, 2002 C-DAC, Pune

S h



a

r

e

d

M

The model assumes

e    Remark :   

m o

einn heddPorm-Pipeline Models Designing Threaded Programs- r

of RSIC processors. the same time contributes to the exceptionally high performance That many instructions may be at various stages of processing at fetching,decoding, operands, computation,and storing results instructions. Each instruction must pass through the stages of the pipeline model. The input to this pipeline is a stream of A RISC (reduced instruction set computing) processors also fits An automotive assembly line is a classic example of a pipeline. time. Each processing stage can handle a different unit of input at a which every unit of input must be processed. A series of sub-operations (known as stages or filters) through A long stream of input

y

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 23

June 03- 06, 2002 C-DAC, Pune

S h

  

a

r

e

d

Threads transfer data to each other using buffers. managesits shared resources.Managing resourcesdifficult very is application. They result from oversights in the way application Common Problems : M

The basic rule for managing shared resources is simple and twofold e     

Designing Threaded Programs-Buffering Data Designing Threaded Programs-Buffering

m

o r

In peer model, peers may often exchange data performsstagethe next of processing In pipeline model, each thread must pass input to the thread that workers. boss/workerIn transferthe boss must model, requeststo the y

Release the when you are finished with the resource Obtain a lock before accessing the resource

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r o

Bugs easily creep into nearly every threaded

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 24

June 03- 06, 2002 C-DAC, Pune S

 h



a r

Producer

e d

relationship requires The ideal producer/consumer

known as the with another thread. The thread that passes the data to another is A thread assumes either of two roles as it exchanges data in a buffer consumer M

   

e m

Designing Threaded Programs-Buffering Data Designing Threaded Programs-Buffering o

A state information A suspend/resume mechanism A lock A buffer

r

y

P

r

o

g r

.

a

m

m i

producer

n

g

A n

Buffer

I

n

t

r o

, the one that receives that data is known as

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright Lock Lock PCOPP PCOPP  Consumer C-DAC, 2002 C-DAC, 2002 2002 25

June 03- 06, 2002 C-DAC, Pune

S h



a

r e

Example for Threaded Program - Matrix Multiplication Matrix Example for Threaded Program -

d

A Matrix multiplication Program : Peer Model

M e      

C

m o

row r

the result arraythe result completely independent of the results for any other element in The computations of each element in the the result array is write to any shared locations No data synchronization is required because the peers never The main thread must wait for the peers to complete Does not require much unusual synchronization array of the matrix C Create a peer thread for each individual element in the result Assume that the program does not involve I/O operations.

y

P

, r

col

o

g

r

a m

= m

a

i n

row

g

– ,

1

A n

*

I

b

n t

1,col

r

o

d

u

c

t i

+

o n

a

t o

row

P t

, h

2

r e

* a

b

d s 2,col + ….. ….. + + Copyright a row , n PCOPP PCOPP * b  n,col C-DAC, 2002 C-DAC, 2002 2002 26

June 03- 06, 2002 C-DAC, Pune

S

h

a

r e

Example for Threaded Program - Matrix Multiplication Matrix Example for Threaded Program -

d

M

e

m o

Input (static)

r

y

P

b a

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Program Mult Mult Mult Peers Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, Output Array c 2002 2002 27

June 03- 06, 2002 C-DAC, Pune

S

h

a

r e

Example for Threaded Program - Matrix Multiplication Matrix Example for Threaded Program -

d

M

e

m o

Input (static)

r

y

a P

b

r

o

g

r

a

m

m

i

n

g

A

n

I

n t

CPU 2 CPU1 r

CPU 0

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Program Mult Mult Mult Peers Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, Output c Array 2002 2002 28

June 03- 06, 2002 C-DAC, Pune

S

h a



r

e

d

M

The thread model takes a process and divide it into two parts

e m

   

o

r

y

modes are Mach Threads and NT Threads interface standards in which System Interface), the family of IEEE operating system global data. This part is referred to as the processwide information), such as program instructions and One contains resources used across the whole program (the The “ subtasks hose execution can be interleaved or run in parallel. is a standardized model for dividing a program into Pthread as a such as program counter, and a stack. This part is referred to The other contains information related to the execution state,

P

r

o

g

r

a m thread

P m ” in

Why Pthreads? : Thread Model

i

n

g

– Pthread

.

A

n

I

n

t

r

o

d

u c

comes form POSIX (Portable Operating

t

i

o

n

t

o

P

t

h

r

e a

Pthread

d s is defined. Otherthread process Copyright PCOPP PCOPP .  C-DAC, 2002 C-DAC, 2002 2002 29

June 03- 06, 2002 C-DAC, Pune S

 

h

a

r

e d

Unix Concurrent Programming : Multiple Process Potential Parallelism

   

M

e

m o

Creating a new process : services the processes can use to communicate with each other Allowing user programs to create multiple processes and providing Overlapping I/O; Asynchronous events; Real-time Scheduling Reasons for investigating a program’s potential parallelism are : parallelism is called potential any order without changing the result - that statementsbe executed canin programpropertya This of -

r y

  

P

r o

the child processes. the child The fork call provides different return values to the parent and The child has its own process identifier, or PID at the time parent called fork with the following properties. fork The that creates a new process is

g

r

a

m m creates a child process that is identical to its parent process

Why Pthreads? : Thread Model

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, fork 2002 .The 2002 30

June 03- 06, 2002 C-DAC, Pune

S h

 

a

r

e

d

aallv Concurrent Programming vs Parallel Concurrent Programming : Multiple Threads Pthreads M

 Remark :    

e

m

o

r y

Implementation specific issues of Pthreads : of Pthreads issues specific Implementation tasks on different processors. on tasks different processors. Parallel Programming : Simultaneous execution of concurrent some or all tasks can be performed at the same time. in any order. One task can occur before or after another, and Concurrent programming : The tasks which we define can occur

Threads are peers Creating a new thread :Pthread_create P



r

o g

Communication, Scheduling Synchronization, Sharing Process Resources;

r

a m Thus, all parallel programming is concurrent, but not all

concurrent programming is parallel. m

Why Pthreads ? : Thread Model

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 31

June 03- 06, 2002 C-DAC, Pune

S h



a

r e

Pthreads implementation fall into three basic categories

d

M

   

e

m o

The Solaris Pthreads maps user threads to LWPs. lightweightschedulers,process (LWPs) or activations implementations are referred to variously as two-level Implementations somewhere between the two. These hybrid Based on pure kernel thread Based on pure

r

y

P r

Understanding PthreadsUnderstanding Implementation

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 32

June 03- 06, 2002 C-DAC, Pune

S

h

a r

   

e

d

M Condition variable functions Mutex Variable functions Pthread_join function

Pthread_once function

e m

   

o

r

y

only once when called by multiple threads routinesensures that get executedinitialization once and Pthread_once is a specialized synchronization tool that which threads have a general interest. A condition variable provides a way of naming an event in data. agree that only one thread at a time can hold the access to allowing threads to control access to data. The threads A mutex variable acts as a mutually exclusive lock, another has terminated) (Pthread_join allows one thread to suspend execution until

P

r

o g

tra Functions for Synchronizing Pthread

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 33

June 03- 06, 2002 C-DAC, Pune

S h

 

a

r e

Mutex Common Synchronization Mechanism

d

M

       

e m

CPUs in the system. CPUs of number the and using you are the platform of regardless properly Pthread through a access data- Semaphores Condition variable Thread safe Read/Write exclusion of synchronization called To protect a shared resource from a , we use a type o

How large doesHow be to requirehave a critical section protection to that Critical section : Provide access to the code paths or routines r

Variables

y

P

r

o

g

r

a m

library operations such as m Pthreads : Synchronizing tools

mutex

i

n

g

data structures

A

n

I

?

n

t r

functions functions

o

d

u

c

t

i

o

n

t

mutex

o

P

t

h

r

e

a d

exclusion, or s mutex locks and unlocks work Copyright mutexfort PCOPP PCOPP  C-DAC, 2002 C-DAC, short 2002 2002 34

June 03- 06, 2002 C-DAC, Pune

S h



a

r e

Write multi-threaded programs with bugs

d

M

   

e m

debuggers to operate well on threaded programs. Being a new technology, vendors have yet to up-grade their Alignment of events among threads that run concurrently Duplicating data corruption or program hangs synchronization problems, namely and race conditions thread from result that errors programming of types Investigate

o r

Event ordering : The ordering of the events performed • A race condition occurs when • is and the easiest to solve - Common reasons for a - •

y

P r

supremely important in debugging a multi-threaded program collectively by a program’s threads at run time becomes through a defined synchronization mechanism at least one of the threads accesses the data without going forgetting a to unlock

o

g

r

a

m

m

i

n

g

A n

Pthreads : Debugging

I

n

t

r

o

d

u

c

t

i

o

n

t

mutex

o

P

t

h

r

e

a

d s multiple threads share data and and data share threads Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 35

June 03- 06, 2002 C-DAC, Pune

S

h

a

r e pthread_cond_broadcast pthread_cond_timedwait

pthread_mutex_destroy d pthread_mutex_trylock pthread_mutex_unlock

pthread_cond_destroy

pthread_cond_signal M pthread_mutex_lock pthread_mutex_init pthread_cond_wait

pthread_cond_init

e

m

o

r

y

P Thread Interaction Primitives in Pthreads

Function

r

o

g

r

a

m

m

i

n

g

– A (…) (…)

(…) n (…)

(…)

(…)

(…) I (…)

(…) n (…)

(…)

t

r

o

d

u

c

t

i

o

n

Wait on a conditional variable up to a timeWait up a on variable conditional

t o Post an event, unlock one waiting process

Post an event, unlock all waiting process event, unlock an Post

P

Wait (block) on a conditional variable a conditional Wait on (block)

t h Creates a new conditional variable

Unlock (release) a mutex variable

r e Lock (acquire) a mutex variable

Try to acquire a mutex variable a Destroy a conditional Destroyvariable a conditional

Creates a new mutex variable

d s Destroy a mutex variable Meaning limit Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 36

June 03- 06, 2002 C-DAC, Pune

S

h a



r

e

d

Its functionality and interface are similar to those of Solaris.

M

e m int int

void*

o

r y pthread_create

int int

P

void

r o pthread_t

pthread_join g (*myroutine

The POSIX Threads (Pthreads) Model

r

a m pthread_exit tra_trt attr, pthread_attr_t*

Function Prototype Function

m

i n

void**status)

g

pthread_self

A n

(pthread_t*

I )(void* ), void* ), )(void*

(pthread_t thread, thread, (pthread_t

n

t

r o

(voic* status) (voic*

d

u

c

t

i

o

n

(void)

t

o

thread_id

P

t

h

r

e a

arg

d s ) , Returns the calling Returns Create a thread A thread exits thread A Join a thread Join thread ID Copyright Meaning PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 37

June 03- 06, 2002 C-DAC, Pune

S

h a

    

r

e

d

Pthreads does not even have a Fortran binding. compiler directives. Pthreads is low level, because it uses the library approach, not Pthreads is not scalable. platforms. Pthreads is portable and supported among major UNIX SMP parallelism. Pthreads supports only thread parallelism not fine grain data

M

e

m

o

r

y

P

r

o g

The POSIX Threads (Pthreads) Model

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 38

June 03- 06, 2002 C-DAC, Pune

S

h a



r

e

d

M

on how they are implemented and how they are used. Threads can represent negligible to significant overhead, depending e

  

m o

einn heddPorm-Performance Designing Threaded Programs-

r y

them, plus the overhead for the including the structures the operating system uses to mange The memory and CPU cycles required to manage each thread, many dependencies among threads. thread is waiting on another thread. This cost results from too The time during which the application is inactive while one to execute the calls. cycles CPU in cost calls These data. to shared access orderly The CPU cycles spent for synchronization calls that enforce

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Pthreads library Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 39

June 03- 06, 2002 C-DAC, Pune

S h

  

a

r e

h oto hrn o uh Locking The cost of Sharing Too much - strictly ordered tasks is a very basic bad design Trying to force concurrency on a large set of Bad design decisions - application ( Multi threaded program can outperform a similar non-threaded

d

M

e m

   

o

r

y

P back More threads share the data, more is the performance pulled performance advantages over other styles of programming The concurrency may give a multithreaded program its greatest overhead. The few calls required to lock and unlock a lock are minimal acceptable. usually it’s so concurrency, un-owned lock.This has little impact on our program’s

Locks reveal the dependencies among he threads to obtain an

r

o

g

r a

o las–application dependent not always –

m m

Pthreads : Performance Issues

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s ) Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 40

June 03- 06, 2002 C-DAC, Pune

S h



a

r e

Thread Overhead

d

M

  

e

m o

they hold them for the shortest possible time. Rule : You should ensure that, when your threads do hold locks, results of the blocked thread The loss can become magnified if other threads depend on the concurrency. accomplishing its task, this delay may cause a significant loss of already held by another thread. Because it keeps the thread from There’s the time a thread spends while waiting for a lock that’s

r

y

P

r

o

g

r

a

m

m i

Pthreads : Performance issues

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 41

June 03- 06, 2002 C-DAC, Pune

S h



a

r e

h oto hrn o uh Locking The cost of Sharing Too much -

d

M

   

e m

to do something before proceeding places in your program where one thread needs to wait for another those use condition variables to synchronize thread against events - data, to access shared to synchronize Rule of thumb locks : Use Threads should hold locks for the shortest time possible. Focus on reducing the amount of data protected by any one lock; a Reduce the lock contention : avoid poor lock placement; Replacing

o r

mutex

y

P

r

o

g

r

a m

with a Condition Variable, Sharing data; m

Pthreads : Performance issues

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 42

June 03- 06, 2002 C-DAC, Pune

S h



a

r e

must perform the following : When a thread is created, the

d

M tras:PromneIse Thread Overhead Pthreads : Performance Issues -

   

e

m o

it allocates for a process The OS allocates resources for the thread that are similar to those call. In a kernel thread-based implementation, this requires a system queues It must place the newly created thread into the system’s scheduling calls that may be in progress at the same time. this threadthe creationsynchronizingother of with structures, data new allocate and searches Database

r

y

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

Pthreads

t

o

P

t

h

r

e

a

d s library (and perhaps the system) Copyright PCOPP PCOPP Pthread_create  C-DAC, 2002 C-DAC, 2002 2002 43

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d

M

    e

tras:Promneise Thread Overhead Pthreads : Performance issues -

m o against the request the thread is meant to process. initialization time so that a thread’s creation expense is not billed run efficiently at the same time. run efficiently Experimentation is needed to determine how many threads can threadoverhead of creation. Re-using existing threads is an excellent way to avoid the task model. one-thread-per-the simplistic this overhead by avoiding Minimize

At length, you should create the maximum number of threads at

r

y

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 44

June 03- 06, 2002 C-DAC, Pune

S h



a r

Pthreads:Performance issues-Thread context switches issues-Thread Pthreads:Performance

e d

CPUs. threads in your program may easily exceed the number of available resources. Even on a platform, the number of Once threads are created, they must share often limited CPU

M

e m

   

o

r

y

P

programmany hasthreads. too Too many context switches may simply mean that your improve ourprogram’s performance avoid the overhead of unnecessary context switches and Reduce the number of involuntary switches as a good way to voluntaryContext switches are saved. are resources private other threads. The running thread is interrupted and it registers and Scheduling a new thread requires a context switch between

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 45

June 03- 06, 2002 C-DAC, Pune S



h Overhead issues-Synchronization Pthreads:Performance

a

r e

somedata structures and executesome code (system call) block Each synchronization object (be it a

d

M

Example :  

e

m o , or

synchronization objects. The cost can be magnified by the way in which you deploy the Crating large number of such objects has its own cost. the memory required to hold it while a thread is running increase the disk space required to store the data base as well as

r

y

P

r o

key )

g

r

a m

If you create a lock for each record in a database, you m

requires that the Pthreads library library Pthreads the that requires

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a d

mutex s , condition variable create Copyright and PCOPP PCOPP  C-DAC, 2002 C-DAC, maintain , once 2002 2002 46

June 03- 06, 2002 C-DAC, Pune S



h Overhead issues-Synchronization Pthreads:Performance

a

r

e

d

How do your threads spend their time ?

M

e m

   

o

r

y

P

of I/O requests ? I/O requests of Are they spending most of their time waiting on the completion because other threads are monopolizing the available CPUs ? Are they their threads to release locks ? Do the threads spend most of their time blocked, waiting for I/O completion performance bottlenecks (CPU Utlisation, waiting for locks and Profiling a program is a good step toward identifying its

r

o

g

r

a

m

m

i

n

g

runnable

A

n

I

n

t

r

o

d

u c

for most of their time but not actually running

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 47

June 03- 06, 2002 C-DAC, Pune

S

h Overhead issues-Synchronization Pthreads:Performance a



r

e

d

M

Performance depends on input workload :

e

m

o

r y

  

P

r o

Increasing clients and contention g Performance depends on a good locking strategy Performance depends on the type of work threads do

Number of clients • • No locks at No all;One lock for the entire data base; One lock • Percentage of Thread I/O •

r

a

m m

for each account in the data base Completion

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

vs

t

h

r

e a

Raito of Time to Completion

d s vs vs CPU and Ratio of Time to Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 48

June 03- 06, 2002 C-DAC, Pune

S

h

a

r e

    

d

M

e m

trasPerformance issues Pthreads Common Synchronization problems with Pthreads an SMP Threads are usually the preferred way oncodesto parallelize language for SMPs specific a platform in written are programs memory Shared Important issues in Shared parallel programming

o

r

y

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u c

Conclusions

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 49

June 03- 06, 2002 C-DAC, Pune

S

h

a

r

e

d

M

e

m

o

r

y

P

r

o

g

r

a

m

m

i

n

g

A

n

I

n

t

r

o

d

u

c

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, 2002 2002 50

June 03- 06, 2002 C-DAC, Pune

S

h a

 

r

e

d

M for SMPs & PVPs Shared memory programs are written in a platform specific language passing parallel programming message as much as progressed not has model memory Shared

 

e

m o

to mention multicomputersto (MPPs and cluster) Such programs are not portable even among multiprocessor,not OpenMP message passing ( Lack of widely accepted standard such as MPI or PVM for

r

y

P

r

o

g

r

a

m m

) i

Shared Memory Programming

n

g

A

n

I

n

t

r

o

d

u c

Scientific Computing Group is accepting

t

i

o

n

t

o

P

t

h

r

e

a

d s Copyright PCOPP PCOPP  C-DAC, 2002 C-DAC, (Contd…) 2002 2002 51