Parallel Computing on a Beowulf Cluster
Supervisor: Dr.Janet Hartman Student: Gaurav Saxena General terms used in Parallel Computing:
Parallel Computing Simultaneous execution of single computational problem on multiple processors in order to obtain faster results Cluster A collection of computers on a network that can function as single computing resource through the use of additional system management software Process An executing instance of a program Processor It’s a physical, tangible ‘thing’ a (CPU chip) in your computer that performs most of the work Shared Memory All processes have same access to the logical memory location irrespective Architecture of where the physical memory is present Distributed Memory All processes have access to only the ‘local’ memory and needs Architecture communication to access the memory of other process. Synchronization In parallel computing, it means bringing the processes to the same point in execution before any can continue. Embarrassingly A computing problem which can easily be divided into parts such that each parallel Problem part running in parallel is independent of every other part. Beowulf Cluster A multi computer architecture consisting of COTS like PC’s , standard Ethernet adapters and switches. Uses commodity operating systems like Linux and uses standard libraries like MPI and PVM Scalability It is defined as a proportionate increase in parallel speedup due to the addition of more processors Why need Parallel Computing?
Time is important If weather prediction for next day takes two days to forecast, prediction is useless. Computationally complex and Time consuming problems like: DNA modeling CFD Simulations, Data Mining 3-D Graphics Rendering Why need Beowulf Cluster ?
Cost of initial supercomputers really high (Butterfly BBN, CRAY1 cost more than 100,000$) Beowulf Cluster are really cheap Use free commodity software like Linux Though communication network slow but computation speed is fast Can give better performance per dollar than parallel computers Parallel Programming Architecture
Shared medium: Allows only one message to be sent. Each processor listens to every message transmitted and receive only the ones which had been sent to it. If processes send messages simultaneously, messages are garbled Messages sent after a random period of time Parallel Programming Architecture
Switched media: Simultaneous transmission of multiple messages among different number of processes can take place Every processor has its own communication path Switch Network Topologies
Assumption: Switches connect processors or other switches Direct topology: Ratio of switch nodes to processor node is 1:1 Every processor connected to one node Indirect topology: Ratio of switch nodes to processor nodes > 1 Evaluation criteria for switch network: Diameter of network, Bisection width, Edges per switch node, Constant edge length Types of Switch Network
2-D Mesh Binary tree Hypertree Butterfly Hypercube Shuffle Exchange Butterfly network Shuffle Exchange Network Classification Scheme for Parallel Computers (Describes the architecture of a CPU. Network architecture deals with arrangement of computers. Both are different)
SISD SIMD
Uniprocessors Processor Arrays Pipelined vector processors MISD MIMD Systolic Arrays Multiprocessors Multicomputers Brief Overview of Computer Architecture
SISD : Single instruction stream and single data stream Single CPU executing a single instruction stream Uniprocessors have SISD architectures SIMD
Single Instruction stream and multiple data stream Examples: Processor Array and Pipelined Vector Processor Single Control unit processing one instruction stream Concurrently executing same operation on multiple data elements Note: Difference exists between ‘processing’ and ‘executing’ MISD
Multiple instruction streams and single data streams Example: Systolic array MISD Computer is like a pipeline “Multiple independent executing function units operating on single stream of data forwarding results from one functional unit to the next” MIMD
Multiple Instruction streams and Multiple data streams Examples: Multiprocessors and multicomputer Multiple CPU’s can execute simultaneously multiple instruction streams affecting different data streams Parallel Programming
Requires knowledge of: Network topology Data Structures Algorithms Patterns of Communication Parallel Programming
Functional view of Parallel Processor Example: Driving a bike requires knowledge of brakes, wheel, head and tail lights, paddle Initial Problems with Parallel Programming
Sequential language like C or FORTRAN used with message-passing library for processes to communicate Vendor has its own set of function calls Programs compiled on one machine maynot work on another machine In 1992, standards decided for message passing libraries MPI was finally introduced in 1997 PVM and OpenMP: other message passing libraries Introduction to MPI
Message Passing Interface is a standard specification for message passing libraries. Programs developed with MPI can be reused on any other faster parallel computer. Not a programming language. Binded with C, C++, FORTRAN, Matlab Knowledge of MPI doesn't imply knowledge of Parallel Programming Introduction to MPI Introduction to MPI
Processes can’t overwrite any other process’s local memory Programs runs deterministically Allow designers to create programs without worrying remote memory access Programs runs faster since local memory access is needed Introduction to MPI
Has over 100 functions Classification of communication: Point to Point Communication Collective communication (Note: Not a programming language) (Used in conjunction with C, C++, JAVA, Matlab, FORTRAN) MPI Programming with C
A simple hello world program:
#include
main( int argc, char* argv[]) { MPI_Status status; /* recieve */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); MPI_Finalize(); } /* main */ Analysis and Performance Measure
Absolute Performance = Elapsed Time(t)
Done by time command in Unix
MPI functions like MPI_WTick(void) and MPI_Wtime(void) Analysis and Performance Measure
Relative Performance: Speedup: P1 / Pn Efficiency: Speedup / n n denotes number of processors P1 denotes time taken to run on 1 processor Pn denotes time taken to run on n processors Analysis and Performance Issues – Amdahl’s Law
Problem contains m operations with p processors Fraction q of operations executed in parallel and 1-q executed in serial Speedup = 1 / (q / p – (1 – q)) If q = 1, all operations are done in parallel Determine asymptotic speedup as number of processors increases Gustafson- Barsis Law
Given by John Gustafson and Ed Barsis Time is considered constant and problem size increases with increasing processors States that q factor in Amdahl’s law is dependant on number of processors Begins with parallel execution time Estimate execution of sequential time Key factors in Performance
Load Balancing Scalability Communication Frequency Percentage of parallelization in a program I/O requirements Problems Analyzed and Solved:
Circuit Satisfiability Problem Analysis of n-body problem in Physics Computing the value of using Simpson’s Rule Sieve of Eratosthenes Floyd’s Algorithm Matrix Vector Multiplication Document Classification Gauss- Siedel Algorithm Mandel Brot Set Gaussian Elimination Analysis of other iterative methods like Jacobi Method Current Work:
Sobel’s Edge Detection Method Gradient Based Edge detection method used for detecting edges Edges considered as pixels with high gradient Uses a pair of 3 * 3 convolution masks One determining the gradient in x –direction Other determining the gradient in y - direction Actual Convolution Masks
-1 0 +1 +1 +2 +1
-2 0 +2 0 0 0
-1 0 +1 -1 -2 -1 Design Considerations:
Why Parallelism? Suppose an image has 1024*1024 pixels Each pixel is 8 bit Storage requirement will be bytes Consider each pixel needs to operated once Requires operations Root node is fast for sequential computation but still take 10ms time in calculation Design Considerations:
121 122 123 125 121 111 112 113 115 114 117 112 145 114 117 110 110 110 111 121 132 123 111 123 123 Bold indicates the pixel whose edge needs to be operated upon Coloured indicated the 3*3 matrix where convolution masks are applied Design Considerations:
Image in PGM format. Simple format for images. Other available format: GIF, JPEG Since it’s a matrix computation, decomposing the problem can be done into column decomposition and row decomposition Column decomposition not useful since each column value needs to be sent individually (involves communication overhead ) Row Major Decomposition is considered Less use of pointers for reading arrays No Load Balancing issue required in this problem Design Considerations:
Each processor will operate on its own part However, there need to be a synchronization barrier such that processors finish their job in equal time After the processors has finished transfer, data will be available in their local memory This is the only communication step required Each processor’s data will be written to the file before the transfer Sobel’s Edge Detection Images: (Before and After) Where to go:
The various algorithms learned and analyzed helped in solving Sobel’s Edge Detection method Sobel’s Edge Detection method useful for Dr. Raghavan’s Domain Decomposition problem Requires further analysis References
http://vlsi1.engr.utk.edu/~bdalton5/report/finalReport.htm#_Toc71392006
http://wwwnavab.in.tum.de/twiki/pub/Chair/TeachingSs04SeminarImaging/05EdgeDetectionHandout.pdf
http://imaging.utk.edu/publications/papers/1996/salinas_ie96.html
http://scholar.lib.vt.edu/theses/available/etd-05132004-140722/unrestricted/Jignesh-Shah-Thesis- Revised-2.pdf
http://alamos.math.arizona.edu/~rychlik/iterator.html
http://www-128.ibm.com/developerworks/linux/library/l-beow.html
http://www-unix.mcs.anl.gov/mpi/implementations.html
http://www.linuxgazette.com/issue65/joshi.html
http://www.hpcquest.ihpc.a- star.edu.sg/files/HPCQuest%202004%20%5DIntroduction%20to%20Parallel%20Computing%20&%20M PI.pdf
http://www.faqs.org/docs/Linux-HOWTO/Beowulf-HOWTO.html#ss2.2