<<

AParallel Gauss-Seidel for Sparse Power Systems

Matrices

D. P. Ko ester, S. Ranka, and G. C. Fox

Scho ol of and Information Science and

The Northeast Parallel Architectures Center (NPAC)

Syracuse University

Syracuse, NY 13244-4100

[email protected], [email protected], [email protected]

A ComdensedVersion of this Paper was presentedatSuperComputing `94

NPACTechnical Rep ort | SCCS 630 4 April 1994

Abstract

We describ e the implementation and p erformance of an ecient parallel Gauss-Seidel algorithm

that has b een develop ed for irregular, sparse matrices from electrical p ower systems applications.

Although, Gauss-Seidel are inherently sequential, by p erforming sp ecialized orderings

on sparse matrices, it is p ossible to eliminate much of the data dep endencies caused by precedence

in the calculations. A two-part ordering technique has b een develop ed | rst to partition

the matrix into blo ck-diagonal-b ordered form using diakoptic techniques and then to multi-color

the data in the last diagonal blo ck using graph coloring techniques. The ordered matrices often

have extensive parallelism, while maintaining the strict precedence relationships in the Gauss-Seidel

algorithm. We present timing results for a parallel Gauss-Seidel solver implemented on the Thinking

Machines CM-5 distributed memory multi-pro cessor. The algorithm presented here requires active

message remote pro cedure calls in order to minimize communications overhead and obtain go o d

relative sp eedup. The paradigm used with active messages greatly simpli ed the implementationof this sparse matrix algorithm.

1 Intro duction

Wehavedevelop ed an ecient parallel Gauss-Seidel algorithm for irregular, sparse matrices from

electrical p ower systems applications. Even though Gauss-Seidel algorithms for dense matrices are

inherently sequential, it is p ossible to identify sparse matrix partitions without data dep endencies so

calculations can pro ceed in parallel while maintaining the strict precedence rules in the Gauss-Seidel

technique. All data parallelism in our Gauss-Seidel algorithm is derived from within the actual

interconnection relationships b etween elements in the matrix. We employed two distinct ordering

techniques in a prepro cessing phase to identify the available parallelism within the matrix structure:

1. partitioning the matrix into blo ck-diagonal-b ordered form,

2. multi-coloring the last blo ck.

Our challenge has b een to identify available parallelism in the irregular sparse p ower systems matrices

and develop an ecient parallel Gauss-Seidel algorithm to exploit that parallelism.

Power system distribution networks are generally hierarchical with limited numb ers of high-

voltage lines transmitting electricity to connected lo cal networks that eventually distribute p ower

to customers. In order to ensure reliability, highly interconnected lo cal networks are fed electricity

from multiple high-voltage sources. Electrical p ower grids have graph representations which in turn

can b e expressed as matrices | electrical buses are graph no des and matrix diagonal elements,

while electrical transmission lines are graph edges which can b e represented as non-zero o -diagonal

matrix elements.

Weshow that it is p ossible to identify the hierarchical structure within a p ower system matrix

using only the knowledge of the interconnection pattern by tearing the matrix into partitions and

coupling equations that yield a blo ck-diagonal-b ordered matrix. No de-tearing-based partitioning

identi es the basic network structure that provides parallelism for the ma jority of calculations within

a Gauss-Seidel iteration. Meanwhile, without additional ordering, the last diagonal blo ckwould b e

purely sequential, limiting the p otential sp eedup of the algorithm in accordance with Amdahl's law.

The last diagonal blo ck represents the interconnection structure within the equations that couple

the partitions found in the previous step. Graph multi-coloring has b een used to order this matrix

partition and subsequently identify those rows that can b e solved in parallel.

We implemented explicit load balancing as part of each of the aforementioned ordering steps

to maximize eciency as the parallel algorithm is applied to real p ower system load- ow matrices.

An attempt was made to place equal amounts of pro cessing in each partition, and in each matrix

color. The metric employed when load-balancing the partitions is the numb er of oating p oint

multiply/add op erations, not simply the number of rows p er partition. Empirical p erformance data

collected on the parallel Gauss-Seidel algorithm illustrate the ability to balance the workload for as

many as 32 pro cessors.

We implemented the parallel Gauss-Seidel algorithm on the Thinking Machines CM-5 distributed

memory multi-pro cessor using the Connection Machine active message layer (CMAML). Using this

communications paradigm, signi cantimprovements in the p erformance of the algorithm were ob-

served compared to more traditional communications paradigms that use the standard blo cking

send and receive functions in conjunction with packing data into communications bu ers. To signif- 1

icantly reduce communications overhead and attempt to hide communications b ehind calculations,

we implemented each p ortion of the algorithm using CMAML remote pro cedure calls. The com-

munications paradigm we use throughout this algorithm is to send a double data value

to the destination pro cessor as so on as the value is calculated. The use of active messages greatly

simpli ed the development and implementation of this parallel sparse Gauss-Seidel algorithm.

Parallel implementations of Gauss-Seidel havehave generally b een develop ed for regular problems

such as the solution of Laplace's equations by nite di erences [3,4], where red-black coloring

schemes are used to provide indep endence in the calculations and some parallelism. This scheme has

b een extended to multi-coloring for additional parallelism in more complicated regular problems [4],

however, we are interested in the solution of irregular linear systems. There has b een some research

into applying parallel Gauss-Seidel to circuit simulation problems [12], although this work showed

p o or parallel sp eedup p otential in a theoretical study. Reference [12] also extended traditional

Gauss-Seidel and Gauss-Jacobi metho ds to waveform relaxation metho ds that trade overhead and

convergence rate for parallelism. A theoretical discussion of parallel Gauss-Seidel metho ds for p ower

system load- ow problems on an alternating sequential/parallel (ASP) multi-pro cessor is presented

in [15]. Other research with the parallel Gauss-Seidel metho ds for p ower systems applications

is presented in [7], although our research di ers substantially from that work. The researchwe

present here utilizes a di erent matrix ordering paradigm, a di erent load balancing paradigm, and

a di erent parallel implementation paradigm than that presented in [7]. Our work utilizes diakoptic-

based matrix partitioning techniques develop ed initially for a parallel blo ck-diagonal-b ordered direct

sparse linear solver [9,10]. In reference [9]we examined load balancing issues asso ciated with

partitioning p ower systems matrices for parallel Choleski factorization.

The pap er is organized as follows. In section 2, weintro duce the electrical p ower system appli-

cations that are the for this work. In section 3, we brie y review the Gauss-Seidel iterative

metho d, then present a theoretical derivation of the available parallelism with Gauss-Seidel for a

blo ck-diagonal-b ordered form sparse matrix. Paramount to exploiting the advantages of this paral-

lel linear solver is the prepro cessing phase that orders the irregular sparse p ower system matrices

and p erforms load-balancing. We discuss the overall prepro cessing phase in section 5, and describ e

no de-tearing-based ordering and graph multi-coloring-based ordering in sections 6 and 7 resp ec-

tively.We describ e our parallel Gauss-Seidel algorithm in section 8, and include a discussion of

the hierarchical data structures to store the sparse matrices. Analysis of the p erformance of these

ordering techniques for actual p ower system load ow matrices from the Bo eing-Harwell series and

for a matrix distributed with the Electrical Power Research Institute (EPRI) ETMSP software are

presented in section 9. Examinations of the convergence of the algorithm are presented along with

parallel algorithm p erformance. We state our conclusions in section 10.

2 Power System Applications

The underlying motivation for our researchistoimprove the p erformance of electrical p ower system

applications to provide real-time p ower system control and real-time supp ort for proactive deci-

sion making. Our research has fo cused on matrices from load- ow applications [15]. Load- ow

analysis examines steady-state equations based on the p ositive de nite network admittance matrix 2

that represents the p ower system distribution network, and is used for identifying p otential network

problems in contingency analyses, for examining steady-state op erations in network planning and

optimization, and for determining initial system state in transient stability calculations [15]. Load

ow analysis entails the solution of non-linear systems of simultaneous equations, which are p er-

formed by rep eatedly solving sparse linear equations. Sparse linear solvers account for the ma jority

of oating p oint op erations encountered in load- ow analysis. Load ow is calculated using network

admittance matrices, which are symmetric p ositive de nite and have sparsity de ned bythepower

system distribution network. Individual p ower utility companies often examine networks in their

op erations centers that are represented by less than 2,000 sparse complex equations, while regional

power authority op erations centers would examine load- ow with matrices that haveasmanyas

10,000 sparse complex equations. This pap er presents data for p ower system networks of 1,723,

4,180, and 5,300 no des.

3 The Gauss-Seidel Metho d

We are considering an iterative solution to the linear system

Ax = b; (1)

where A is an (n  n) sparse matrix, x and b are vectors of length n,andwe are solving for x.

Iterativesolvers are an alternative to direct metho ds that attempt to calculate an exact solution

to the system of equations. Iterative metho ds attempt to nd a solution to the system of linear

equations by rep eatedly solving the linear system using approximations to the x vector. Iterations

continue until the solution is within a predetermined acceptable b ound on the error.

Common iterative metho ds for general matrices include the Gauss-Jacobi and Gauss-Seidel,

while conjugate gradient metho ds exist for p ositive de nite matrices. Critical in the choice and

use of iterative metho ds is the convergence of the technique. Gauss-Jacobi uses all values from the

previous iteration, while Gauss-Seidel requires that the most recentvalues b e used in calculations.

The Gauss-Seidel metho d generally has b etter convergence than the Gauss-Jacobi metho d, although

for dense matrices, the Gauss-Seidel metho d is inherently sequential. Better convergence means fewer

iterations, and a faster overall algorithm, as long as the strict precedence rules can b e observed. The

convergence of the iterative metho d must b e examined for the application along with algorithm

p erformance to ensure that a useful solution to Ax = b can b e found.

The Gauss-Seidel metho d can b e written as:

1 0

X X

1

(k +1) (k ) (k +1)

A @

; (2) b a x a x x =

i ij ij

j j i

a

ii

ji

where:

(k )

th th

x is the i unknown in x during the k iteration, i =1; ;n and k =0; 1;::: ,

i

(0)

th

x is the initial guess for the i unknown in x,

i

th th

a is the co ecientofA in the i row and j column,

ij 3

 1

while >

conv er g e

for k =1 to n

iter

for i =1 to n

x~ x

i i

x b

i i

for each j 2 [1;n] suchthata 6=0

ij

x x (a  x )

i i ij j

endfor

x x =a

i i ii

endfor

endfor

 0

for i =1 to n

  + abs(~x x )

i i

endfor

endwhile

Figure 1: Sparse Gauss-Seidel Algorithm

th

b is the i value in b.

i

or

(k +1) 1 (k )

x =(D + L) [b Ux ]; (3)

where:

(k ) th

x is the k iterative solution to x, k =0; 1;::: ,

(0)

x is the initial guess at x,

D is the diagonal of A,

L is the of strictly lower triangular p ortion of A,

U is the of strictly upp er triangular p ortion of A,

b is right-hand-side vector.

The representation in equation 2 is used in the development of the parallel algorithm, while the

equivalent matrix-based representation in equation 3 is used b elow in discussions of available paral-

lelism.

We present a general sequential sparse Gauss-Seidel algorithm in gure 1. This algorithm cal-

culates a constantnumb er of iterations b efore checking for convergence. For very sparse matrices,

suchaspower systems matrices, the computational complexity of the section of the algorithm which

(k +1)

checks convergence is O (n), nearly the same as that of a new iteration of x . Consequently,we

p erform multiple iterations b etween convergence checks. Only non-zero values in A are used when

(k +1)

calculating x .

It is very dicult to determine if one-step iterative metho ds, like the Gauss-Seidel metho d,

converge for general matrices. Nevertheless, for some classes of matrices, it is p ossible to prove

Gauss-Seidel metho ds do converge and yield the unique solution x for Ax = b with any initial

(0)

starting vector x . Reference [4] proves theorems to show that this holds for b oth diagonally

dominant and symmetric p ositive de nite matrices. The pro ofs of these theorems state that the 4

Gauss-Seidel metho d will converge for these matrix typ es, however, there is no evidence as to the

rate of convergence.

Symmetric sparse matrices can b e represented by graphs with elements in equations corresp ond-

ing to undirected edges in the graph [6]. Ordering a symmetric sparse matrix is actually little more

than changing the lab els asso ciated with no des in an undirected graph. Mo difying the ordering

of a sparse matrix is simple to p erform using a p ermutation matrix P of either zeros or ones that

simply generates elementary row and column exchanges. Applying the p ermutation matrix P to the

original linear system in equation 1 yields the linear system

T

(PAP )(Px)=(Pb); (4)

that is solved using the parallel Gauss-Seidel algorithm. While ordering the matrix greatly simpli es

accessing parallelism inherent within the matrix structure, ordering can have an e ect on convergence

[4]. In section 9, we present empirical data to show that in spite of the ordering to yield parallelism,

convergence app ears to b e rapid for p ositive de nite p ower systems load- ow matrices.

4 Available Parallelism

While Gauss-Seidel algorithms for dense matrices are inherently sequential, it is p ossible to identify

p ortions of sparse matrices that do not havemutual data dep endencies, so calculations can pro ceed

in parallel on mutually indep endent matrix partitions while maintaining the strict precedence rules

in the Gauss-Seidel technique. All parallelism in the Gauss-Seidel algorithm is derived from within

the actual interconnection relationships b etween elements in the matrix. Ordering sparse matri-

ces into blo ck-diagonal-b ordered form can o er substantial opp ortunity for parallelism, b ecause the

(k +1)

values of x in entire sparse matrix partitions can b e calculated in parallel without requiring

communications. Because the sparse matrix is a single system of equations, all equations (with o -

diagonal variables) are dep endent. Dep endencies within the linear system requires data movement

from mutually indep endent partitions to those equations that couple the linear system. After wede-

velop the Gauss-Seidel algorithm for a blo ck-diagonal-b ordered matrix, the optimum data/pro cessor

assignments for an ecient parallel implementation are straightforward.

While much of the parallelism in this algorithm comes from the blo ck-diagonal-b ordered ordering

of the sparse matrix, further ordering of the last diagonal blo ck is required to provide parallelism

in what would otherwise b e a purely sequential p ortion of the algorithm. The last diagonal blo ck

represents the interconnection structure within the equations that couple the partitions in the blo ck-

diagonal p ortion of the matrix. These equations are rather sparse, often with substantially fewer o -

diagonal matrix elements (graph edges) than diagonal matrix elements (graph no des). Consequently,

it is rather simple to color the graph representing this p ortion of the matrix. Separate graph colors

(k +1)

representrows where x can b e calculated in parallel, b ecause within a color, no twonodeshave

any adjacent edges. For the parallel Gauss-Seidel algorithm, a synchronization barrier is required

(k +1)

between colors to ensure that all new x values are distributed to the pro cessors so that the

strict precedence relation in the calculations are maintained. 5

4.1 Parallelism in Blo ck-Diagonal-Bordered Matrices

To clearly identify the available parallelism in the blo ck-diagonal-b ordered Gauss-Seidel metho d, we

de ne a blo ck diagonal partition on the matrix, apply that partition to formula 3, and equate terms

to identify available parallelism. Wemust also de ne a sub-partitioning of the last diagonal blo ck

to identify parallelism after multi-coloring.

T

First, we de ne a partitioning of the system of linear equations (PAP )(Px)=(Pb), where the

p ermutation matrix P orders the matrix into blo ck-diagonal-b ordered form.

0 1 0 1 0 1

(k )

A 0 A x b

1;1 1;m+1 1

1

B C B C B C

. . .

.

B C B C B C

. . . .

.

B C B C B C

0 . . .

= : (5)

B C B C B C

(k )

B C B C B C

A A x b

m

m;m m;m+1 m

@ A @ A @ A

(k )

A  A A x b

m+1;1 m+1;m m+1;m+1 m+1

m+1

T

Equation 3 divides the PAP matrix into a diagonal comp onent D, a strictly lower diagonal

p ortion of the matrix L, and a strictly upp er diagonal p ortion of the matrix U suchthat:

T

PAP = D + L + U (6)

Derivation of the blo ck-diagonal-b ordered form of theD, L, and U matrices is straightforward. Equa-

1

tion 3 requires the calculation of (D + L) , which also is simple to determine explicitly, b ecause

this matrix has blo ck-diagonal-lower-b ordered form. Given these partitioned matrices, it is rela-

tively straightforward to identify available parallelism by substituting the partitioned matrices and

(k )

partitioned x and b vectors into the de nition of the Gauss-Seidel metho d and then p erforming

the matrix multiplications. As a result we obtain:

h i

0 1

(k ) (k )

1

(D + L ) b U x U x

1;1 1;1 1 1;1 1;m+1

1 m+1

B C

B C

.

.

B C

.

(k +1)

B C

h i

x = :

B C

(k ) (k )

1

B C

(D + L ) b U x U x

m

m;m m;m m m;m m;m+1

m+1

@ A

h i

P

m (k +1) (k )

1

1

(D + L ) b (L x ) U x

m+1;m+1 m+1;m+1 m+1 m+1;m+1

m+1

m+1;i i

i=1

(7)

We can identify the parallelism in the blo ck-diagonal-b ordered p ortion of the matrix by examining

equation 7. If we assign each partition i,(i =1; ;m), to a separate pro cessor the calculations of

(k +1) (k )

x are indep endent and require no communications. Note that the vector x is required for

i m+1

the calculations in each partition, and there is no violation of the strict precedence rules in the Gauss-

(k +1)

Seidel b ecause it is calculated in the last step. After calculating x in the rst m partitions, the

i

(k +1)

values of x must b e calculated using the lower b order and last blo ck. From the previous step,

m+1

(k +1)

the values of x would b e available on the pro cessors where they were calculated, so the values

i

(k +1)

1

x ) can b e readily calculated in parallel. Only (matrix  vector) pro ducts, calculated of (L

m+1;i i

in parallel, are involved in the communications phase. Furthermore, if we assign

m

 

X

(k +1)

1

^

b = b L x ; (8)

m+1

m+1;i i

i=1 6

(k +1)

lo oks similar to equation 3: then the formulation of x

m+1

i h

(k +1)

(k ) (k +1) 1

^

^

: (9) b U x x = x =(D + L )

m+1;m+1 m+1;m+1 m+1;m+1

m+1

4.2 Parallelism in Multi-Colored Matrices

The ordering imp osed by the p ermutation matrix P, includes multi-coloring-based ordering of the

last diagonal blo ck that pro duces sub-partitions with parallelism, We de ne the sub-partitioning as:

0 1

^ ^ ^

D A  A

1;1 1;2 1;c

B C

^ ^ ^

B C

A D  A

2;1 2;2 2;c

B C

A = : (10)

B C

m+1;m+1

. .

.

B . . . C

.

. .

@ A

^ ^ ^

A A  D

c;1 c;2 c;c

^

where D are diagonal blo cks and c is the numb er of colors. After forming L and U ,

i;i m+1;m+1 m+1;m+1

it is straightforward to prove that:

i h

1 0

P

0 1

(k )

1

^

^ ^

(k +1)

^

b A x D

1 1;j

1;1

j ^

j>1 x

1

h i

C B

B C

P P

(k +1) (k )

C B (k +1)

1

^ ^ ^ ^

B C

^ ^

b A x D A x

^

x

2 2;j C 2;j B

2;2

j<2 j j>2 j

2

B C

(k +1)

C B

^

x = : (11) =

B C

.

C B

.

B C .

.

C B

.

.

@ A

A @

h i

P

(k +1)

(k +1)

1

^ ^ ^

^

x

c

^

b A x D

c c;j

c;c

j

j

(k +1) (k +1)

(k +1)

^ ^ ^

Calculating x in each sub-partition of x do es not require values of x within the

i i

(k +1)

^

sub-partition, so we can calculate the individual values within x in any order and distribute

i

these calculations to separate pro cessors without concern for precedence. In order to maintain the

k +1

^

strict precedence in the Gauss-Seidel algorithm, the values of x calculated in eachstepmust b e

i

broadcast to all pro cessors, and pro cessing cannot pro ceed for any pro cessor until it receives the

(k +1)

^

new values of x from all other pro cessors.

i

If the blo ck-diagonal-b ordered matrix partitions A , A ,andA (1  i  m)are

i;i m+1;i i;m+1

(k +1)

assigned to the same pro cessor, then there are no communications until x is calculated. At

m+1

that time, only (matrix  vector) pro ducts are sent to the pro cessors that hold the appropriate data

in the last diagonal blo ck. This pro cessor/data assignment to pro cessors is de ned bymulti-coloring

only the last diagonal blo ck.

Figure 2 describ es the calculation steps in the parallel Gauss-Seidel for a blo ck-diagonal-b ordered

sparse matrix. This gure depicts four diagonal blo cks, and data/pro cessor assignments (P1, P2,

P3, and P4) are listed for the data blo ck. Figure 3 illustrates the data/pro cessor assignments in the

last diagonal blo ck.

5 The Prepro cessing Phase

In the previous section, wedevelop ed the theoretical foundations of parallel Gauss-Seidel metho ds

with blo ck-diagonal-b ordered sparse matrices, and nowwe will discuss the pro cedures required 7 (1) SOLVE FOR x IN DIAGONAL BLOCKS x b

P1 0 P1 P1 P1 P1

P2 0 P2 P2 P2 P2

0 P3 = 0 P3 P3 P3 P3 0 P4 P4 P4 P4 0 P4

P1 P2 P3 P4

(2) CALCULATE (MATRIX X VECTOR) (3) SOLVE FOR x PRODUCT AND SEND IN LAST

DIAGONAL BLOCK

Figure 2: Blo ck-Bordered-Diagonal Form Gauss-Seidel Metho d

xb

P1 0 P1 P2 P2 C1 P3 P3 0 P4 P4

P1 0 P1 C2 P2 P2 P3 P3 0 P4 P4 P1 0 P1 C3 P2 P2 P3 P3 0 P4 P4

(1) SOLVE FOR x WITHIN A COLOR

(2) BROADCAST NEW x VALUES

Figure 3: Multi-Colored Gauss-Seidel Metho d for the Last Diagonal Blo ck 8

to generate the p ermutation matrices, P, to pro duce blo ck-diagonal-b ordered/multi-colored sparse

matrices so that our parallel Gauss-Seidel algorithm is ecient. Wemust reiterate that all parallelism

for our Gauss-Seidel algorithm is identi ed from the interconnection structure of elements in the

sparse matrix during this prepro cessing phase. Wemust order the sparse matrix in such a manner

that pro cessor loads are balanced. The technique wehavechosen for this prepro cessing phase is to:

1. order the matrix into blo ck-diagonal-b ordered form while minimizing the size of the last diag-

onal blo ck,

2. order the last diagonal blo ck using multi-coloring techniques.

Inherent in b oth prepro cessing steps is explicit load-balancing to determine pro cessor/data mappings

for ecient implementation of the Gauss-Seidel algorithm.

This prepro cessing phase incurs signi cantly more overhead than solving a single instance of the

sparse matrix; consequently, the use of this technique will b e limited to problems that have static ma-

trix structures that can reuse the ordered matrix and load balanced pro cessor assignments multiple

times in order to amortize the cost of the prepro cessing phase over numerous matrix solutions.

5.1 Ordering the Matrix into Blo ck-Bordered-Diagonal Form

We require a technique that orders irregular matrices into blo ck-diagonal-b ordered form while limit-

ing the numb er of coupling equations. Minimizing the numb er of coupling equations minimizes the

size of the last diagonal blo ckina block-diagonal-b ordered sparse matrix, and minimizes the amount

(k +1)

of broadcast communications required when calculating values of x in the last diagonal blo ck.

The e ects of minimizing the size of the last diagonal blo ck are not all p ositive. Wehave found that

minimizing the size of the last blo ck can a ect p otential parallelism if the resulting workload for cal-

(k +1)

culating x in the diagonal blo cks cannot b e distributed uniformly throughout a multi-pro cessor

| in which case there is load imbalance b etween multi-pro cessors [9]. When determining the opti-

mal ordering for a sparse matrix, the size of the last diagonal blo ck and the subsequent additional

communications may b e traded for an ordering that yields go o d load balance in the highly parallel

p ortion of the calculations, esp ecially when using larger numb ers of pro cessors.

The metho d wehavechosen to order a sparse matrix into blo ck-diagonal-b ordered form is referred

to as no de-tearing [13], which is a sp ecialized form of diakoptics [5]. Wehave selected no de-tearing

no dal analysis b ecause this algorithm determines the natural structure in the matrix while providing

the means to minimize the numb er of coupling equations. With the no de-tearing algorithm, wecan

determine the hierarchical structure in a p ower system distribution grid solely from the interconnec-

tion relationships in the sparse matrices. Tearing here refers to breaking the original problem into

smaller sub-problems whose partial solutions can b e combinedtogive the solution of the original

problem. Load balancing techniques must b e used after the no de tearing matrix ordering step to

uniformly distribute the pro cessing load onto a multi-pro cessor.

The no de-tearing-based ordering algorithm has the ability to adjust the characteristics of the

ordering byvarying an input parameter. Empirical data is presented later in section 9 for multi-

ple orderings to illustrate the parallel linear solver algorithm p erformance as a function of input

parameters to the no de-tearing algorithm. 9

Load balancing for no de-tearing-based ordering can b e p erformed with a simple pigeon-hole typ e

algorithm that uses a metric based on the numb er of oating p ointmultiply/add op erations in a

partition, instead of simply using the numberofrows p er partition. Load balancing examines the

(k +1)

numb er of op erations when calculating x in the matrix partitions and the numb er of op erations

(k +1)

when calculating the sparse matrix vector pro ducts in preparation to solveforx in the last

diagonal blo ck. These metrics do not consider indexing overhead, which can b e rather extensive

when working with very sparse matrices stored in an implicit form. This algorithm nds an optimal

distribution for workload to pro cessors, however, actual disparity in pro cessor workload is dep en-

dent on the irregular sparse matrix structure. This algorithm works b est when there are minimal

disparities in the workloads for indep endent blo cks or when there are signi cantly more indep endent

blo cks than pro cessors. In this instance, the workloads in multiple small blo cks can sum to equal

the workload in a single blo ck with more computational workload.

5.2 Ordering the Last Diagonal Blo ck

The application of diakoptic techniques yields a blo ck-diagonal-b ordered matrix form that identi es

the basic network structure and provides parallelism for the ma jority of calculations within a Gauss-

Seidel iteration. However, without additional ordering, the last diagonal blo ckwould b e purely

sequential, limiting the p otential sp eedup of the algorithm in accordance with Amdahl's law. The

last diagonal blo ck represents the interconnection structure within the equations that couple the

partitions found in the previous step. In other words, the variables in the last-diagonal blo ckare

the interconnections within the equations that tie the entire matrix together. Graph multi-coloring

has b een used for ordering this p ortion of the matrix | all no des of the same color share no

(k +1)

interconnections, consequently, the values of x in these rows can b e calculated in any order

without violating the strict precedence rules in the Gauss-Seidel metho d. As a result, rows within a

color can b e solved in parallel.

The multi-coloring algorithm we selected for this work is based on the saturation degree ordering

algorithm. We also require load balancing, a feature not commonly implemented within graph

multi-coloring. As part of our implementation we added a feature that equalizes the number of rows

p er color to provide some basic load balancing. The graph multi-coloring technique is discussed in

greater detail in section 7.

6 No de-tearing No dal Analysis

A detailed theoretical derivation of no de-tearing is to o lengthy to describ e here in rigorous math-

ematical terms. We refer interested readers to references [10, 13] for pro ofs of the mathematics,

although a brief description of no de-tearing follows.

Let the N denote the no des of a graph G and let E denote the edges in G ,or G =(N ; E ). In

summary, no de-tearing is a greedy algorithm that partitions the no des N in G into

i m

N N [

1

1

i=1

(12)

N N N

2 1 10

where:

N is the set of no des in the mutually indep endent partitions

1

i

N is the set of no des in a mutually indep endent partition

1

N is the set of no des in the coupling equations

2

j

i

Mutual indep endence o ccurs when no edges in E are connected to edges in E 8 i 6= j and i; j =

1

1

j

i

1; 2;:::;m. Consequently, G has no edges in common with G , 8i 6= j , and there are no edges

1

1

j j

i i

directly interconnecting any no des in N and N , 8i 6= j . Connectivitybetween G and G , 8i 6= j ,

1 1

1 1

is indirect and must go through no des in N .

2

In addition to ordering matrices into blo ck-diagonal-b ordered form using no de-tearing, werequire

that the numb er of coupling equations, jN j,isminimized over all distinct partitions fN ; N g of G

2 1 2

k

while also sp ecifying that jN jmax , k =1; 2;:::;m. This constraint p ermits some control of

DB

1

the maximum size of diagonal blo cks, max ,which can prove quite useful when tearing a graph for

DB

solving on multi-pro cessors. By mo difying this parameter, control can b e exercised over the shap e

of the ordered sparse matrix | yielding small blo cks when max is small and limiting the size of

DB

the b orders in a blo ck-diagonal-b ordered matrix when max is large. This optimization problem

DB

b elongs to the family of NP-complete problems [13], so a simple, ecient heuristic algorithm has

b een develop ed based on examining the contour of the graph [13]. A contour-tableau contains three

lists:

k k

, or the p otential elements of a set of no des in the sub-graph N 1. the iterating sets I

i i

k

2. the adjacency sets A or the set of no des adjacent to, but not including any elements in the

i

corresp onding iterating set,

k

3. contour numb ers c or the cardinality of the adjacency set.

i

k

,foreach A As we p erform no de-tearing, wewant to minimize the size of the adjacency set,

i

partition and subsequently this will minimize jN j. A separate contour-tableau is develop ed for each

2

diagonal blo ck.

The software implementation to p erform no de-tearing no dal analysis utilizes the basic concept

of building a contour tableau to identify indep endent sub-matrices and the coupling equations in

an undirected graph representing a sparse matrix. In our implementation, the searchforthelocal

minimum of the contour numb er is limited to within the range (  max )  i  max ,

DB DB

k

0 < <1. When an indep endent sub-matrix is found, this iterating set is moved into a set N ,

1

k

where jN j= i. Figure 4 illustrates the ma jor steps in the no de-tearing ordering algorithm. The

1

algorithm examines all no des essentially once, where the size of the indep endent sub-blo cks are

k

jn), due to the limited to max . The computational complexity of this algorithm is O (max jA

DB i

i

fact that all no des in the graph must b e examined, and for each element in the contour tableau |

k

all elements of the adjacency set must b e examined for the next no de. The value of max jA j must

i

i

b e less than n, and b ecause the graphs will b e sparse, the maximum numb er in the adjacency set

will b e substantially less than n. 11

/* the function ( ) determines the no des adjacentto */

G the symmetric graph representing the sparse matrix

while G6=  do

while i  max do

DB

k

select  2A such that j( )j = min k j( )j

i i

 2I

(i1)

(i1)

k k

I I [f g

i

(i1)

k k

A A [ ( ) f g

i i

i

(i1)

if (  max )  i  max

DB DB

determine the location of the local minimum

endif

endwhile

k k

N I

1

k

N N [A

2 2

k

G GN N

2

1

end while

Figure 4: The No de-Tearing Algorithm

^

N N (the nodes in the sparse last diagonal block)

2

^

while N6=  do

^

select a node  from N such that 

has the largest number of neighbors with di erent colors

 ( ) the consistent color with the fewest occurrences

^ ^

N N

end while

Figure 5: The Graph Multi-Coloring Algorithm

7 Graph Coloring

Multi-coloring a graph G is an NP-complete problem that attempts to de ne a minimum number of

colors for the no des of a graph where no adjacent no des are assigned the same color [8, 11]. A greedy

heuristic can yield an optimal ordering if the vertices are visited in the correct order. We selected the

saturation degree ordering algorithm [8], but mo di ed it to include load-balancing. The saturation

degree ordering algorithm selects a no de in the graph that has the largest numb er of di erently

colored neighb ors. Wehave added the capability to the saturation degree ordering algorithm to

select the color for a no de in a manner that equalizes the numb er of no des with a particular color.

We simply select the consistent color with the fewest number of nodes.

We presentourversion of the saturation degree ordering-based graph multi-coloring algorithm

in gure 5. The computational complexity of this algorithm is O (max j ( )jn^ ), where  ( )

^ ^ ^

 2N G G

^

de nes the set of no des in G adjacentto . The graphs encountered for coloring in this work were

very sparse, generally with no more than three no des adjacenttoany single no de. 12

8 Parallel Gauss-Seidel Implementation

Wehave implemented a parallel version of a blo ck-diagonal-b ordered sparse Gauss-Seidel algorithm

in the C programming language for the Thinking Machines CM-5 multi-computer using the Connec-

tion Machine active message layer (CMAML) remote pro cedure call as the basis for interpro cessor

communications [14]. Signi cantimprovements in the p erformance of the algorithm were observed

for active messages, when compared to more traditional communications paradigms that use the

send and CMMD receive functions in conjunction with packing data into standard blo cking CMMD

communications bu ers. A signi cant p ortion of the communications require each pro cessor to send

short data bu ers to every other pro cessor, imp osing signi cant communications overhead due to

latency.To signi cantly reduce communications overhead and attempt to hide communications b e-

hind calculations, we implemented each p ortion of the algorithm using CMAML remote pro cedure

calls (CMAML rpc). The communications paradigm we use throughout this algorithm is to send a

double precision data value to the destination pro cessor as so on as the value is calculated. Com-

munications in the algorithm o ccur at distinct time phases, making p olling for the active message

handler function ecient. An active message on the CM-5 has a four word payload, which is more

than adequate to send a double precision oating p ointvalue and an integer p osition indicator. The

use of active messages greatly simpli ed the development and implementation of this parallel sparse

Gauss-Seidel algorithm, b ecause there was no requirement to maintain and pack communications

bu ers.

This implementation uses implicit data structures based on vectors of C programming language

structures to store and retrieve data eciently within the sparse matrix. These data structures

provide go o d cache coherence, b ecause non-zero data values and column lo cation indicators are

stored in adjacentphysical memory lo cations. The is comp osed of six separate

parts that implicitly store the blo ck-diagonal-b ordered sparse matrix and the last blo ck. Figure 6

graphically illustrates the relationships within the data structure. As illustrated in the gure, the

blo ck-diagonal structure, the b order-row structure, and the last-blo ck-diagonal structure contain

p ointers to the sparse rowvectors. The second values in the two diagonal p ointers are the values of

a , while the second value in the b order-row structure is the destination pro cessor for the (vector

ii

 vector) pro duct from this b order row used in calculating values in the last diagonal blo ck.

Our parallel Gauss-Seidel algorithm has the following distinct sections where blo cks are de ned

in section 4:

(k +1)

1. solvefor x in the diagonal blo cks

 

P

(k +1)

m

1

^

2. calculate b = b L by forming the (matrix  vector) pro ducts in x

m+1

m+1;i i

i=1

parallel

(k +1)

^

3. solvefor x in the last diagonal blo ck

A pseudo-co de representation of the parallel Gauss-Seidel solver is presented in gure 7. A version of

the softwareisavailable that runs on a single pro cessor on the CM-5 to provide empirical sp eed-up

data to quantify multi-pro cessor p erformance. This sequential software includes the capabilityto

gather convergence-rate data. The parallel implementation has b een develop ed as an instrumented 13 ...... BLOCK SPARSE ROW VECTORS DIAGONAL

...... BORDER SPARSE ROW VECTORS ROWS

...... LAST SPARSE ROW VECTORS BLOCK

DIAGONAL

Figure 6: The Data Structure

pro of-of-concept to examine the eciency of each section of the co de describ ed ab ove. The host

pro cessor is used to gather and tabulate on the multi-pro cessor calculations. Statistics are

gathered at synchronization p oints, so there is no impact on total empirical measures of p erformance.

Empirical p erformance data is presented in the next section for varied numb ers of pro cessors solving

real p ower systems sparse load- ow matrices.

9 Empirical Results

Overall p erformance of our parallel Gauss-Seidel linear solver is dep endent on b oth the p erformance

of the matrix ordering in the prepro cessing phase and the p erformance of the parallel Gauss-Seidel

implementation. Because these twocomponents of the parallel Gauss-Seidel implementation are

inextricably related, the b est way to assess the p otential of this technique is to measure the sp eedup

p erformance using real p ower system load- ow matrices. We rst present sp eedup and eciency

data for three separate p ower systems matrices:

 Bo eing-Harwell matrix BCSPWR09 | 1,723 no des and 2,394 edges in the graph [1]

 Bo eing-Harwell matrix BCSPWR10 | 5,300 no des and 8,271 edges in the graph [1]

 EPRI matrix EPRI-6K matrix | 4,180 no des and 5,226 edges in the graph [2]

Matrices BCSPWR09 and BCSPWR10 are from the Bo eing Harwell series and represent electrical

power system networks from the Western and Eastern US resp ectively. The EPRI-6K matrix is

distributed with the Extended Transient-Midterm Stability Program (ETMSP) from EPRI. These

matrices were prepro cessed using a sequential program that ordered the matrix, load balanced each

ordering step, and subsequently pro duced the implicit data structures required for the parallel blo ck-

diagonal-b ordered Gauss-Seidel linear solver. Due to the static nature of the p ower system grid, such

an ordering would b e reused over many hours of calculations in real electrical p ower utility op erations

load- ow applications.

Matrix prepro cessing was p erformed for multiple values of max , the input value to the no de-

DB

tearing algorithm. Empirical p erformance data was collected for each of the aforementioned p ower 14

No de Program

 1

while >

conv er g e

for k =1 to n

iter

(k +1)

/* solve for x in the diagonal blo cks */

for all rows i in blo cks assigned to this pro cessor

x~ x

i i

x b

i i

for each j 2 [1;n] suchthata 6=0

ij

x x (a  x )

i i ij j

endfor

x x =a

i i ii

endfor

/* calculate the (matrix  vector) pro ducts in the lower b order */

for all rows i in the last blo ck assigned to this pro cessor

x~ x

i i

x b

i i

endfor

for all non-zero rows i in the lower b order of this blo ck

for each j suchthata 6=0

ij

  (a  x )

ij j

endfor

at processor  =) x x  using active message rpc

i i i

endfor

(k +1)

^

/* solve for x in the last diagonal blo ck*/

for all colors c

for all rows i in color c assigned to this pro cessor

for each j 2 [1;n]suchthata 6=0

ij

x x (a  x )

i i ij j

endfor

x x =a

i i ii

broadcast x using active message rpc

i

endfor

wait until al l values of x havearrived

i

endfor

endfor

/* check convergence */

 0



for all rows i assigned to this pro cessor

  + abs(~x x )

  i i

endfor

P

  using active message rpc



8

endwhile

Figure 7: Parallel Gauss-Seidel 15

systems matrices using 1 through 32 pro cessors on the Thinking Machines CM-5 at the Northeast

Parallel Architectures Center at Syracuse University.TheNPAC CM-5 is con gured with all 32

no des in a single partition, so user software was required to de ne the numb er of pro cessors used

to actually solve a linear system. Empirical data collected on the parallel Gauss-Seidel algorithm

will b e presented in twoways. We rst present sp eedup and eciency data from the three p ower

systems matrices. Relative sp eedup and eciency are presented using the times required to p erform

four iterations and a single convergence check. Next, we provide a detailed p erformance analysis

using actual run times for the individual subsections of the parallel Gauss-Seidel linear solver. This

detailed p erformance analysis illustrates the ecacy of the load balancing step in the prepro cessing

phase, and illustrates other p erformance b ottlenecks.

De nition | Relative Sp eedup Given a single problem with a sequential algorithm running

on one processor and a concurrent algorithm running on p independent processors, relative speedup

is de nedas

T

1

S  ; (13)

p

T

p

where T is the time to run the sequential algorithm as a single process and T is the time to run

1 p

the concurrent algorithm on p processors.

De nition | Relative Eciency Relative eciency is de nedas

S

p

E  ; (14)

p

p

where S is relative speedup and p is the number of processors.

p

9.1 Performance Analysis

As an intro duction to the p erformance of the parallel Gauss-Seidel algorithm, we presentapairof

graphs that plot relative sp eedup and relative eciency versus the numb er of pro cessors. Figure 8

plots the b est sp eedup and eciency measured for eachofthepower systems matrices for 2, 4, 8,

16, and 32 pro cessors. These graphs show that p erformance for the EPRI-6K data set is the b est of

the three data sets examined. Sp eedup reaches a maximum of 11.6 for 32 pro cessors and sp eedups

of greater than 10.0 were measured for 16 pro cessors. This yields a relative eciency of 63% for 16

pro cessors and 36% for 32 pro cessors.

Relative sp eedups for the BCSPWR09 and BCSPWR10 matrices are less than for the EPRI-6K

matrix, but each has sp eedup in excess of 7.0 for 16 pro cessors. The reason for reduced p erformance

with these matrices for larger numb ers of pro cessors is the size of the last blo ck after ordering. For

b oth the BCSPWR09 and BCSPWR10 matrices, the last diagonal blo ck requires approximately 5%

of the total calculations while the last blo ck of the EPRI-6K matrix can b e ordered so that only

1% of all calculations o ccur there. As the numb er of pro cessors increases, communications overhead

(k +1)

b ecomes a signi cant part of the overall pro cessing time b ecause x values in the last diagonal

blo ckmust b e broadcast to other pro cessors b efore pro cessing can pro ceed to the next color. Even

though the second ordering phase is able to three-color the last diagonal blo cks, communications

overwhelms the pro cessing time for larger numb ers of pro cessors and minimizes sp eedup in this

(k +1)

p ortion of the calculations. There are insucient parallel op erations when solving for x in the 16 RELATIVE SPEEDUP FOR GAUSS SEIDEL RELATIVE EFFICIENCY FOR GAUSS SEIDEL 1 BCSPWR09 BCSPWR09 16 BCSPWR10 0.8 BCSPWR10 EPRI-6K EPRI-6K 12 0.6

8 0.4

4 0.2

RELATIVE SPEEDUP 2 0 RELATIVE EFFICIENCY 0 2 4 8 16 32 2 4 8 16 32

NUMBER OF PROCESSORS NUMBER OF PROCESSORS

Figure 8: Relative Sp eedup and Eciency | 2, 4, 8, 16, and 32 pro cessors

1 128 NODES 128 NODES 16 192 NODES 0.8 192 NODES 256 NODES 256 NODES 12 320 NODES 0.6 320 NODES

8 0.4

4 0.2

RELATIVE SPEEDUP 2 0 RELATIVE EFFICIENCY 0 2 4 8 16 32 2 4 8 16 32

NUMBER OF PROCESSORS NUMBER OF PROCESSORS

Figure 9: Relative Sp eedup and Eciency for EPRI-6K Data | 2, 4, 8, 16, and 32 pro cessors

diagonal blo cks for these matrices to o set the e ect of the nearly sequential last blo ck. The e ect of

Amdahl's law is visible for larger numb ers of pro cessors due to the sequential nature of one p ortion

of the algorithm.

The BCSPWR09 matrix encounters an additional problem in that the ordering phase was unable

to e ectively balance the workload in the p ortion of the software that pro cesses all but the last blo ck.

This matrix is the smallest examined, and there is insucientavailable parallelism in the matrix to

supp ort 16 or more pro cessors.

A detailed examination of relative sp eedup and relative eciency is presented in gure 9 for

the EPRI-6K data. This gure contains two graphs that eachhave a family of four curves plotting

relative sp eedup and relative eciency for each of four maximum matrix partition sizes used in the

no de-tearing algorithm. The maximum partition sizes used when prepro cessing this data are 128,

192, 256, and 320 no des. The family of sp eedup curves for the various matrix orderings clearly

illustrates the e ects of load imbalance for some matrix orderings. For all four matrix orderings,

sp eedup is nearly equal for 2 through 16 pro cessors. However, the values for relative sp eedup diverge

for 32 pro cessors. 17 Diagonal Blocks and Lower Border Last Block 64 4 RUN TIME - 128 RUN TIME - 128 32 RUN TIME - 192 RUN TIME - 192 RUN TIME - 256 RUN TIME - 256 16 RUN TIME - 320 2 RUN TIME - 320 8 4 1

MILLISECONDS 2 MILLISECONDS

1 0.5 2 4 8 16 32 2 4 8 16 32 NUMBER OF PROCESSORS NUMBER OF PROCESSORS

Update Last Block Check Convergence 8 16 RUN TIME - 128 RUN TIME - 128 RUN TIME - 192 RUN TIME - 192 4 RUN TIME - 256 8 RUN TIME - 256 RUN TIME - 320 RUN TIME - 320 2 4

1 2 MILLISECONDS MILLISECONDS

0.5 1 2 4 8 16 32 2 4 8 16 32

NUMBER OF PROCESSORS NUMBER OF PROCESSORS

Figure 10: Timings for Algorithm Comp onents | EPRI-6K Data | 2, 4, 8, 16, and 32 pro cessors

We can lo ok further into the cause of the disparity in the relative sp eedup values in the EPRI-6K

data by examining the p erformance of each of the four distinct sections of the parallel algorithm.

Figure 10 contains four graphs that eachhave a family of four curves that plot the pro cessing time

in milliseconds versus the numb er of pro cessors for each of four values of max from the no de-

DB

tearing algorithm. The values of max used when prepro cessing this data are 128, 192, 256, and

DB

320 no des. These graphs are log-log scaled, so for p erfect sp eedup, pro cessing times should fall on a

straight line with decreasing slop e for rep eated doubling of the numb er of pro cessors. One or more

curves on each of the p erformance graphs for the diagonal blo cks and lower b order, for up dating the

last diagonal blo ck, and for convergence checks illustrate nearly p erfect sp eedup with as manyas

(k +1)

32 pro cessors. Unfortunately the p erformance for calculating values of x in the last blo ckdoes

not also have stellar parallel p erformance.

The p erformance graph for the diagonal blo cks and lower b order clearly illustrates the causes for

the load imbalance observed in the relative sp eedup graph in gure 9. For some matrix orderings, load

balancing is not able to divide the work evenly for larger numb ers of pro cessors. This always o ccurs

for larger values of max , the maximum size of a blo ck when ordering a matrix. Nevertheless,

DB

when ordering a matrix for sixteen or more pro cessors, selecting small values of max will provide

DB

go o d sp eedup for larger numb ers of pro cessors. The p erformance curves presented in gure 8, shows

the b est p erformance observed for the four matrix orderings.

Performance of up dating the last blo ckby p erforming sparse (matrix  vector) pro ducts and then 18

p erforming irregular communications yields go o d p erformance even for 32 pro cessors. The times to

p erform up dates is correlated to the size of the last diagonal blo ck, whichisinversely related to

the magnitude of max . The relationship b etween the magnitude of max and the size of the

DB DB

last blo ckisintuitive, b ecause as the magnitude of max increases, multiple smaller blo cks can

DB

b e incorp orated into a single blo ck. Not only can two smaller blo cks b e consolidated into the single

blo ck, but in addition, any elements in the coupling equations that are unique to those network

partitions could also b e moved into the larger blo ck.

The p erformance graph for convergence checking illustrates that the load balancing step do es

not assign equal numb ers of rows to all pro cessors. The numberofrows on a pro cessor varies as a

function of the load balancing. While the family of curves on this graph are more erratic than the

curves representing p erformance in diagonal blo cks and the lower b order and the p erformance of

up dating the last diagonal blo ck, p erformance generally is improving with near p erfect parallelism

even for 32 pro cessors.

Information on the relative p erformance as a function of max and the numb er of pro cessors

DB

would b e required when implementing this parallel algorithm in a load- ow analysis application.

To minimize the e ects of data movement, the application would require that the entire pro cess to

calculate the Jacobian when solving the systems of non-linear equations consider the pro cessor/data

assignments from the sparse linear solver. The time to solveeach instance of the linear equations

generated by the Jacobian is so small that all data redistribution must b e eliminated, otherwise, the

b ene ts observed from parallel pro cessing sp eedup in an application will b e lost.

Performance of this parallel Gauss-Seidel linear solver is dep endent on the p erformance of the

matrix prepro cessing phase. Wemust reiterate that all available parallelism in this work is a result

of ordering the matrix and identifying relationships in the connectivity pattern within the structure

of the matrix. Power systems load ow matrices are some of the most sparse irregular matrices

encountered. For the EPRI-6K data, the mo de in a histogram of the numb er of edges p er no de is

only two! In other words, the most frequentnumb er of edges at a no de is only two. 84.4% of the

no des in the EPRI-6K data have three or less edges. For the BCSPWR10 matrix, 71% of the no des

have three or less edges. Consequently,power systems matrices p ose some of the greatest challenges

to pro duce ecient parallel sparse matrix algorithms.

In gures 11 and 12, we presenttwo orderings of the EPRI-6K data with max equal to 128

DB

and 256 no des resp ectively. Non-zero entries in the matrix are represented as dots, and the matrices

are delimited by a b ounding b ox. Each of these gures contain three sub- gures: the ordered sparse

matrix and two enlargements of the last blo ck | b efore and after multi-coloring. Both matrices

have b een partitioned into blo ck-diagonal-b ordered form and load-balanced for eight pro cessors.

The numb ers of no des in the last diagonal blo cks are 153 and 120 resp ectively, while the numbers

of edges in this part of the matrix are 34 and 22 resp ectively. The graph multi-coloring algorithm is

able to color these p ortions of the matrices with three and two colors resp ectively. These matrices

represent the adjacency structure of the network graphs, and clearly illustrate the sparsity in these

matrices. 19 BEFORE COLORING

AFTER

COLORING

Figure 11: Ordered EPRI-6K Matrix | max =128 DB

BEFORE COLORING

AFTER

COLORING

Figure 12: Ordered EPRI-6K Matrix | max =256

DB 20

Total Error

P

(k +1) (k ) (k +1) (k +1)

Iteration abs(x x ) min x max x

8i 8i

8i i i i i

1 0.983654764216 0.000000119293 0.000478248320

2 0.000143661684 0.000000124521 0.000478321438

3 0.000000018556 0.000000124522 0.000478321442

4 0.000000000002 0.000000124522 0.000478321442

5 0.000000000000 0.000000124522 0.000478321442

Table 1: Convergence for EPRI-6K Data | max =256

DB

9.2 Convergence Rate

Critical to the p erformance of an iterative linear solver is the convergence of the technique for a

given data set. Wehave applied our solver to sample p ositive de nite matrices that have actual

power networks as the basis for the sparsity pattern, and random values for the entries. Wehave

examined convergence for various matrices and various matrix orderings. A sample of the measured

convergence data is presented in table 1. This table presents the total error for an iteration, and the

(0)

minimum and maximum values encountered that iteration. All initial values, x ,have b een de ned

12

to equal 0:0. Convergence is rather rapid, and after four iterations, total error equals 2  10 .

Consequently, only a few iterations are required for reasonable convergence with this pro cedure on

this data. Wehyp othesize that this go o d convergence rate is in part due to having go o d estimates

of the initial starting vector. For actual solutions of p ower systems load ows, this solver would b e

used within an iterative non-linear solver, so go o d estimates of starting p oints for each solution also

will b e readily available.

9.3 Comparing Communications Paradigms

Underlying the whole concept of active messages is the paradigm that the user takes the resp onsibility

for handling messages as they arrive at a destination. The user writes a handler function that

takes the data from a register and uses it in a calculation or assigns the data to memory.By

assigning message handling resp onsibilities to the user, communications overhead can b e signi cantly

reduced. The e ect of reduced overhead can b e clearly seen in this algorithm, when p erformance of

an active message-based algorithm is compared to p erformance of an algorithm with more common

blo cking send and receive commands. The requirement in this algorithm to broadcast the values

(k +1)

of x b efore the next color can pro ceed causes substantial amounts of communications. In the

(k +1)

p ortion of the algorithm that solves for values of x in the last diagonal blo ck, the amountof

2

communications is O (n ), and as the numb er of pro cessors increases, the size of the messages

pr ocs

for conventional message passing decreases. For traditional message passing paradigms, the cost for

communications increases drastically as the numb er of pro cessors increases, b ecause each message

incurs the same latency regardless of the amountofdatasent. Meanwhile, with active messages,

latency is greatly reduced b ecause the user has the resp onsibility to pro cess the message. This

increase in the numb er of messages can b e seen in gure 10, as the p erformance for solving for 21

values in the last blo ckeventually increases slightly as the numb er of pro cessors increases. For an

algorithm based on a more traditional send and receive paradigm, p erformance quickly b ecomes

unacceptable in this p ortion of the calculations as the numb er of pro cessors increases.

10 Conclusions

Wehavedevelop ed a parallel sparse Gauss-Seidel solver with the p otential for go o d relative sp eedup

and relative eciencies for the very sparse, irregular matrices encountered in electrical p ower system

applications. Blo ck-diagonal-b ordered matrix structure o ers promise for simpli ed implementation

and also o ers a simple decomp osition of the problem into clearly identi able subproblems. The

no de-tearing ordering heuristic has proven to b e successful in identifying the hierarchical structure

in the p ower systems matrices, and reducing the numb er of coupling equations so that the graph

multi-coloring algorithm can usually color the last blo ck with only two or three colors. All avail-

able parallelism in our Gauss-Seidel algorithm is derived from within the actual interconnection

relationships b etween elements in the matrix, and identi ed in the sparse matrix orderings. Conse-

quently,available parallelism is not unlimited. Relative sp eedup tends to increase nicely until either

load-balance overhead or communications overhead cause sp eedup to level o .

Wehave shown that, dep ending on the matrix, relative eciency declines rapidly after 8 or 16

pro cessors, limiting the utility of applying large numb ers of pro cessors to a single parallel linear

solver. Nevertheless, other dimensions exist in electrical p ower system applications that can b e

exploited to use large numb ers of pro cessors eciently. While a mo derate numb er of pro cessors

can b e eciently applied to a single p ower system simulation, multiple events can b e simulated

simultaneously.

Acknowledgments

We thank Alvin Leung, Kamala Anupindi, Nancy McCracken, Paul Co ddington, and Tony Skjellum

for their assistance in this research. This work has b een supp orted in part by Niagara Mohawk Power

Corp oration, the New York State Science and Technology Foundation, the NSF under co-op erative

agreement No. CCR-9120008, and ARPA under contract #DABT63-91-K-0005.

References

[1] I. S. Du , R. G. Grimes, and J. G. Lewis. Users` Guide for the Harwell-Bo eing Sparse Ma-

trix Collection. Technical Rep ort TR/PA/92/86, Bo eing Computer Services, Octob er 1992.

(available by anonymous ftp at orion.cerfacs.fr).

[2] Electrical Power Research Institute, Palo Alto, California. ExtendedTransient-Midterm Stability

Program: Version 3.0 - Volume 4: Programmers Manual, Part 1, April 1993.

[3] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on

Concurrent Processors.Prentice Hall, 1988. 22

[4] G. Golub and J. M. Ortega. Scienti c Computing with an Introduction to Paral lel Computing.

Academic Press, Boston, MA., 1993.

[5] H. H. Happ. Diakoptics - The Solution of System Problems byTearing. Proceedings of the

IEEE, 62(7):930{940, July 1974.

[6] M. T. Heath, E. Ng, and B. W. Peyton. Parallel Algorithms for Sparse Linear Systems. In

Paral lel Algorithms for Matrix Computations, pages 83{124. SIAM, Philadelphia, 1991.

[7] G. Huang and W. Ongsakul. Managing the Bottlenecks in Parallel Gauss-Seidel Typ e Algo-

rithms for Power Flow Analysis. Proceedings of the 18th Power Industry Computer Applications

(PICA) Conference, pages 74{81, May1993.

[8] M. T. Jones and P. E. Plassman. AParallel Graph Coloring Heuristic. SIAM Journal on

Scienti c Computing, 14(3):654{69, May1993.

[9] D. P. Ko ester, S. Ranka, and G. C. Fox. Parallel Blo ck-Diagonal-Bordered Sparse Linear

Solvers for Electrical Power System Applications. In A. Skjellum, editor, Proceeding of the

Scalable Paral lel Libraries Conference. IEEE Press, 1994.

[10] D. P. Ko ester, S. Ranka, and G. C. Fox. Parallel Choleski Factorization of Blo ck-Diagonal-

Bordered Sparse Matrices. Technical Rep ort SCCS-604, Northeast Parallel Architectures Center

(NPAC), Syracuse University, Syracuse, NY 13244-4100, January 1994.

[11] D. W. Matula, G. Marble, and J. D. Isaacson. Graph Coloring Algorithms. Acedemic Press,

Mew York, 1972.

[12] R. A. Saleh, K. A. Gallivan, M. Chang, I. N. Ha jj, D. Smart, and T. N. Trick. Parallel Circuit

Simulation on Sup ercomputers. Proceedings of the IEEE, 77(12):1915{1930, Decemb er 1989.

[13] A. Sangiovanni-Vincentelli, L. K. Chen, and L. O. Chua. No de-Tearing No dal Analysis. Tech-

nical Rep ort ERL-M582, Electronics Research Lab oratory, College of , University

of California, Berkeley, Octob er 1976.

[14] Thinking Machines Corp oration, Cambridge, MA. CMMD Reference Manual, 1993. Version

3.0.

[15] Y. Wallach. Calculations and Programs for Power System Networks. Prentice-Hall, 1986. 23