Development and Evaluation of A

DEVELOPMENT AND EVALUATION

OF A MULTIPROCESSING STRUCTUAL

VIBRATION ALGORITHM /*

A Thesis Presented to

The Faculty of the College of Engineering and Technology

Ohio University

In Partial Fulfillment

of the Requirement for the Degree

Master of Science in Civil Engineering

by Michael Morel- November, 1988 Abstract

A new parallel algorithm utilizing the finite element method for free vibration analysis of large linear elastic models is developed, tested and evaluated herein. The concurrent solution of the generalized eigenproblem, KG = MQR, is based on the classical frontal technique for the solution of linear simultaneous equations and the modified subspace method for the solution of the least dominant eigenpairs. Large structures are subdivided into independent domains containing an equal number of elements, i.e. a balanced load within each domain to avoid overhead; a common boundary (global front) exists between all domains. Using the multitasking library on the Cray X-MP/24 computer, each domain is assigned a separate processor to create the stiffness and mass matrices of the elements and perform the simultaneous assembly/forward elimination and back-substitution completely independent of all other domains. Parallelism is also exploited in the modified subspace method by projecting the stiffness and mass matrices onto the required subspace in each iteration within each domain. In other words, a wave front sweeps across the domain and converges to the global front, upon reaching the boundary the domains communicate their interface matrices to all other domains and diverge from the global front. The parallel algorithm successfully uses a completely connected network to transmit the matrices needed by the other domains to continue execution. Following the back-substitution the specified eigenpairs are solved and tested against the required tolerance level, if tolerance is met the program ceases opera- tion. Computational speedup and efficiency is used to determine the effectiveness of parallel processing within domains on a number of rectangular plates. The investigation entails a number of factors affecting the assessment of the algorithm, e.g. number of domains, size of the global front, direction the wave front converges and the number of eigenpairs. In addition, a number of typical finite element problems in free vibration are analyzed to show the accuracy and speedup of the parallel numerical algorithm. Acknowledgements

The author wishes to express his gratitude to Dr. F. Akl who's guidance, patience and intellect spurred my enthusiasm to reach the gods set herein. Also, the Structural Dynamics

Branch at NASA Lewis Research Center who provided the facilities and financial support for this project. A special note of appreciation is extended to Dr. J. Recktenwald for his suggested improvements and identifying typographical errors, in addition, Dr. J. Starzyk who represented the College of Engineering and Technology on my thesis committee. Contents

Abstract iii

Acknowledgements v

Contents vi

List of Figures ix

List of Tables xii

List of Symbols xiii

1 INTRODUCTION 1

1.1 Scope ...... 1

1.2 Terminology...... 2

1.3 Objective ...... 3

1.4 Outline ...... 3

2 STRATEGY BACKGROUND 5

2.1 Introduction ...... 5

2.2 Frontal Solution ...... 6

2.3 Modified Subspace Method ...... 8 vii

2.4 Multitasking ...... 10

2.4.1 Program Multi ...... 15 2.5 Parallel Implementation of a Multi-Frontal Subspace Method ...... 18

3 A PARALLEL PROCESSING PROGRAM FOR LARGE FINITE ELE-

MENT EIGEN ANALYSIS 2 4

3.1 Introduction ...... 24

3.2 Parallel Architecture for Communication ...... 25

3.3 Facilities of Program ...... 26

3.4 Assumptions and Restrictions ...... 28

3.5 Parallel FEDA ...... 29

3.5.1 Nomenclature ...... 29

3.5.2 Outline of Parallel FEDA ...... 31

4 APPLICATION OF PARALLEL FEDA 38

4.1 Purpose ...... 38

4.2 Finite Element Problems and Results ...... 39

4.2.1 Two-Dimensional Beam ...... 39

4.2.2 Space Truss ...... 41

4.2.3 Plane Stress Example ...... 46

4.2.4 Isoparametric Plate ...... 48

5 INVESTIGATION OF PARALLEL FEDA 5 1

5.1 Preview ...... 51

5.2 Background to Testing ...... 54

5.3 Evaluation of Varying Domains ...... 57 5.4 Examination of Subroutines ...... 65

5.5 Impact of Increasing Elements ...... 68

5.6 Subspace Dimension ...... 70

5.7 Size of Global Front ...... 73

6 CONCLUSIONS AND RECOMMENDATIONS 76

REFERENCES 80

Appendix A: Data Input for Parallel FEDA 8 5

Appendix B: Output of Parallel FEDA 91 List of Figures

2.1 Finite Element Problem Subdivided into N Domains ......

2.2 Parallel Processing within Domains ......

3.1 Completely Connected Parallel Architecture ......

3.2 Parallel &orithm for the i - th Domain ......

4.1 Beam Idealization ......

4.2 Idealization of Space Truss ......

4.3 Helicopter Tail-Boom Idealization ......

4.4 Helicopter Tail-Boom Structure: (a) Geometry of Tail-Boom; (b) Finite El-

ement Model for the Tail-Boom Structure (Arora and Nguyen, 1980) ....

4.5 Idealization of T-Section ......

4.6 Plate with Hole Idealization ......

5.1 Different Size Plates Used in Analysis with all Edges Clamped: (a) 8 Element

Plate (67 dof); (b) 16 Element Plate (115 dof); (c) 24 Element Plate (163

dof); (d) 32 Element Plate (211 dof); (e) 40 Element Plate (259 dof) ....

5.2 A 64 Element Plate Clamped (c) on all Edges ......

5.3 Time Chart of Four Logical Processor System Running on a Two Physical

Processor Machine ...... 5.4 Time Chart Showing the Interpretation of a Four Logical Processor System

Running on a Two Physical Processor Machine ...... 56

5.5 Two Domain Configuration of 64 Element Plate with Element Numbering

Scheme ...... 59

5.6 Two Domain Configuration with Horizontal Global Front for 64 Element

Plate ...... 59

5.7 Four Domain Idealization of 64 Element Plate with Element Numbering

Scheme and Vertical Local Fronts ...... 60

5.8 Four Domain Idealization with a Cross (+) Global Front for the 64 Element

Plate ...... 60

5.9 Horizontal Global Front with Four Domains on a 64 Element Plate . . . . . 61

5.10 Six Domain Configuration of the 64 Element Plate, Domains One and Six

have 12 Elements Each all Other Domains Contain Ten Elements ...... 61

5.11 Six Domain Idealization of a 64 Element Plate, Domains One and Six have

12 Elements Each with all Others Containing Ten Elements ...... 62

5.12 Eight Domain Configuration of the 64 Element Plate with Vertical Local Fronts 62

5.13 Eight Domain Idealization of 64 Element Plate ...... 64

5.14 Subdivided Plate for all Test Runs with the Direction of the Local Fronts

Shown ...... 69

5.15 Effect of Increasing the Number of Elements on the Execution Time with all

Other Factors Constant ...... 70

5.16 CPU Time Affected by Extra Elements with all Major Factors Constant a

Limit of Two Subspace Iterations were Used ...... 71 5.17 Impact on Execution Time with an Increasing Subspace Dimension with all

Other Factors Constant ...... 72

5.18 Influence of Increased Eigenpairs on the CPU Time Keeping all Factors Con-

stant ...... 72

5.19 Impact of Increasing the Global Front on the Different Four Domain Models

(Figs. 5.7-5.9) ...... 75

6.1 Effectiveness of Parallel FEDA for a 64 Element Plate with q = 6 and Six

Subspace Iterations ...... 77

6.2 Evaluation of 64 Element Plate on Parallel FEDA for q = 6 and Six Subspace

Iterations ...... 78

B.l A Two Domain 64 Element Plate with 13 dof on the Global Front ..... 91 List of Tables

2.1 Roots of the 11th Order Lobatto Rule ......

4.1 Clamped-Clamped Beam Eigenvalues ......

4.2 Space Truss Eigenvalues ......

4.3 Eigenvalues for Helicopter Tail-Boom ......

4.4 Eigenvalues for Plane Stress T-Section ......

4.5 Eigenvalues for a Plate with a Hole ......

5.1 Analysis of Various Domains for a 64 Element Plate Using a Subspace

(q)of2 ......

5.2 Analysis of Various Domains for a 64 Element Plate Using a Subspace

(q)of6 ......

5.3 Evaluation of 64 Element Plate Limited to Two Subspace Iterations ....

5.4 Subroutine Speedups for Varying Domain Sizes with q = 6 (Refer to Table

5.3 for Figure Numbers and Section 3.5.2 for Subroutine Descriptions) ...

5.5 Percentage of Execution Time taken by Parallel FEDA for the Two Domain

Model in Fig . 5.5 ......

5.6 Impact of Degrees of Freedom on Global Front for q = 6 and Two Subspace

Iterations ......

xii List of Symbols

area right-hand side = [MI [V] (nxq) right-hand side belonging to the i - th domain local front right-hand sides assembled global front right-hand sides element of matrix [B] domain displacement matrix Young's modulus of elasticity load matrix moment of inertia unit matrix stiffness matrix of size nxn element stiffness stiffness matrix belonging to the i - th domain local front stiffness matrix assembled global front stiffness matrix subspace stiffness matrix of size qxq element of stiffness matrix length mass matrix of size nxn element mass matrix subspace mass matrix of size qxq number of elements per domain total number of degrees of freedom of system number of domains, processors or tasks number of eigenpairs size of subspace 5 n eigenvector of the auxiliary eigenproblem 1 - th root of Lobatto rule thickness eigenvectors at the 1 - th iteration element eigenvector at the l - th iteration over-relaxation factor eigenvalues of required subspace eigenvectors of required subspace i - th eigenvalue i - th eigenvector mass density Poisson's ratio largest eigenvalue of subspace Chapter 1 INTRODUCTION

1.1 Scope:

With structures in free-vibration analysis becoming larger and increasingly complex, the engineer is called upon to solve these proble~sEster and more accurate!). 1)y ilsinq the

sequential digital computer to their advantage. Nevertheless, the engineer is limited by

time, size and perplexity of the problems they can analyze. No longer can we rely on

accelerating computer speed to answer these problems because higher computer performance is quickly reaching its ultimate peak (Quinn, 1987). Engineers must look for new, quicker

and more efficient procedures to solve large finite element problems above and beyond the basic sequential coding of numerical methods used in practice today.

Parallel processing, i.e. computing information concurrently for the solution of a sin-

gle problem, has evolved to meet these demands. The hardware is available to develop

parallel numerical methods for analysis of complex structural finite element vibration prob-

lems. Therefore efficient algorithms must be advanced to meet these computing needs while

maintaining the dependability of existing sequential algorithms. A parallel algorithm using the unique characteristics of the frontal solution and modified subspace method to achieve a faster execution time in the solution of large eigenpairs for structural finite element problems is developed. These redefined and refined methods in a parallel setting will hopefully breakdown this barrier even further, enabling engineers to use this computer program in a productive fashion.

1.2 Terminology:

Definitions are given for the following terms to clarify their meaning:

r Domain - a section of a subdivided finite element model considered as an indepen-

dent structure except for a common boundary that connects the substructure to the

complete finite element model.

r Speedup - the ratio between the time needed for a sequential algorithm to solve

the problem divided by the time taken to execute the same problem on a parallel

algorithm.

r Task - the code and data of the program, whose instructions must be processed in

sequential order. A separate task (the computational model) will be assigned to each

domain (the physical model), therefore, the number of tasks are equivalent to the

number of domains in the finite element model. 1.3 Objective:

There are three objectives of this research: First, to develop a parallel processing algorithm for the extraction of eigenpairs that will decrease the total execution time of a multi- processing program compared to a similar sequential program called "FEDA" (Akl, 1978

and Akl, et al., 1979). The potential goal will be to achieve a speedup equivalent to the theoretical speedup of the parallel algorithm.

Secondly, test the performance of the new algorithm on a number of typical finite

element eigenvalue problems for accuracy. The results will be compared to other results

obtained from NASTRAN (Schaeffer, 1979) and closed-form solutions (Paz, 1985, Bostic

and Fulton. 1986 and Arora and Nguyen, 1980).

Finally, investigate the speedup of the new parallel algorithm in a number of different

situations based upon the substructuring of the finite element model. As a result, guidelines

will be presented to help the user divide their finite element model in the most efficient

manner to achieve an improved speedup.

1.4 Outline:

In the following chapters a parallel finite element algorithm for large structural systems in

free-vibration is developed, tested and investigated. The ensuing chapter explains the basic

method and characteristics of the frontal technique, modified subspace method and multi-

tasking used to exploit the parallel processing capabilities of the Cray X-MP/24 computer. Fundamental equations will be formulated and identified that were used in the implementation stage.

CHAPTER 3 gives a full description of the algorithm and incorporates the strategy from the previous chapter into a successful computer program. Capabilities, assumptions and limitations will be given to clarify the program. An outline of the program is included with an explanation of each subroutine that is called inside the program.

A number of typical finite element problems will be analyzed in CHAPTER 4 and compared to other results obtained. The speedups of the computation are presented to show the impact of the parallel solution over a sequential algorithm.

In the next chapter, an investigation;s is carried out on a 64 element plate clamped on all edges with the assessment of the algorithm based on a number of controlling variables, e.g. number of domains, substructuring of finite element model and the dimension of the subspace. Guidelines will be formulated to help the user efficiently define their model into a number of domains that will be concurrently processed.

CHAPTER 6 will draw final conclusions from the experimentation and give suggestions for further research. Chapter 2

STRATEGY BACKGROUND

2.1 Introduction:

The analysis of a structure in free motion provides the most important dynamic properties of the structure which are the natural frequencies and the corresponding modal shapes.

There are two important places where these properties can be used, either when the mode superposition method will be used to determine the displacement response of a structure subjected to dynamic loads, or when it is required to estimate the natural frequencies of a system in free vibration analysis. The design of aircraft structures, spacecraft structures, ship structures, machine components and framed structures are some examples where dynamic analysis is desired. Many of these structures must be represented by large finite element models requiring substantial amounts of computational time to solve. Efficient numerical methods are needed to solve large eigenproblems quickly and accurately, but taking the problem of large finite element models one step farther by using parallel processing can achieve an even greater speedup in time while keeping the same accuracy. The three most time consuming procedures in the solution of large eigenproblems are the creation of element stiffness and mass matrices, the solution of linear simultaneous equations and the extraction of eigenpairs. The efficiency and power of the following procedures, the frontal technique, modified subspace method and multitasking have prompted the author to incorporate them in the solution of large eigenvalue problems. In this chapter, the frontal technique (Irons, 1970), modified subspace method (AM, et al., 1979) and multitasking (Cray Research, Inc., 1987) are presented and their characteristics are dis- cussed. Though the three procedures were previously published it is deemed necessary to repeat some of their details in order to grasp a full understanding of how each procedure relates to the others. In addition, an explanation of how these methods work together in a multi-processing environment will be given.

2.2 Frontal Solution:

The algorithm developed uses the frontal solution to solve the set of linear simultaneous equations:

[KIP1 = [Fl (2.1)

The main idea of the frontal technique is to assemble the equations and eliminate the variables at the same time. As soon as all contributions are made and assembled from all relevant nodes, each degree of freedom for this node is eliminated. As a result, the total stiffness matrix is never formed because after elimination of the degree of freedom the reduced equation is stored in an array of inactive variables. The frontal technique uses a direct solution based on Gauss elimination with back-substitution to solve for the unknown displacements. The wave front, i.e. the active nodes on the front, divides the structure into two substructures with three sections of elements. The first section includes the elements that have already been processed, the second section is the active elements on the front and the third is the elements that have yet to be processed. The front begins at one end of the structure and advances, engulfing one element at a time, eliminating the nodes on the element that are fully assembled until it has swept over the whole structure. After all nodes are eliminated the front then reverses its previous moves over the entire structure to complete the process and perform the back-substitution.

The frontal solution possesses certain advantages over other direct techniques and has proven to be a very effective and powerful means for solving the positive definite symmetric equations arising in standard finite element analysis. The band matrix methods are the chief competitor of the frontal solution. A comparison between the two shows that for small problems the frontal and band routines are about the same, because of the extra coding required by the frontal routine. However, with larger analyses, the frontal routine is superior in terms of speed and core requirements (Irons, 1970 and Melosh and Bramford, 1969). In order to optimize the solution the band matrix methods require optimum numbering of nodes, while element numbering is not important. On the other hand, the frontal solution requires optimum numbering of the elements, while node numbering is immaterial with respect to solution optimization. Optimum numbering of elements will reduce the largest frontwidth, defined as the maximum degrees of freedom on the wave front at any point in time as the wave front sweeps across the structure. 2.3 Modified Subspace Method:

The general eigenvalue problem for a structural system is defined as:

in which [K]is the stiffness matrix and [MI is the mass matrix, each being an order of n x n where n is the number of degrees of freedom for the finite element model. Furthermore, [@I is a modal matrix where each column is an eigenvector, and [R]is a diagonal matrix containing eigenvalues as its diagonal elements. The matrices [K]and [MI are symmetric; [K]is positive definite and [MI may be only positive semi-definite. Therefore, the eigenvalues of equation (2.2) are real and positive, and the eigenvectors are orthogonal with respect to [K] and [MI and satisfy the orthogonality conditions

where [I] is a unit matrix.

For solution of large-order eigenproblems, the modified subspace iteration method (Akl,

1983) is used. This method is a combination of both inverse iteration and the Rayleigh

Ritz procedures and has an accelerated rate of convergence of about one-third faster (Akl,

1978), on the average, when compared to the classical subspace method approach (Bathe and Wilson, 1973).

Solving for all eigenvalues and eigenvectors can be very time consuming and costly for large finite element problems. However, a structure can be adequately analyzed using only a few modes, without significant loss in accuracy by neglecting higher modes. The modified subspace method has proven to be an accurate and efficient technique in evaluating eigenpairs in large structural systems, where probably no more than 40 modes are required

in systems having many hundred to a few thousand degrees of freedom (Akl, 1978).

Assuming one wishes to calculate the p lowest eigenpairs of equation (2.2), where

pp. One should start with q =

min{2p,p+8), for this improves the accuracy of the p eigenvectors and the corresponding eigenvalues (Bathe, 1982). In concluding, equation (2.2) represents the finite element model to be analyzed where the lowest q eigenpairs are of interest and thus represent a subspace of dimension q. The basic algorithm of the modified subspace method is as follows (Akl,

1983):

1. Let [VI1 be q starting eigenvectors.

2. Operate on each [V]!as follows:

[V];+, = [A']-' [MI [v]~= [I{]-' [B]!

where e = 1,2,3,...

3. Modify [V];+l to increase convergence rate.

for != 1 PI = 0 and e>11 pr=O

!= 2, .....,11 pe = the interval points of the 11-th order Lobatto rule spanning 0 to

1/w; taken in progressive order (Table 2.1)

where wi is the largest eigenvalue in the subspace. 4. Project [K] and [MI onto the current subspace:

[Kli+i = [vl;:~ [~l[vI&-i

[MI;+, = [~l;:l[~I[~l;+l

5. Solve the auxiliary eigen problem.

[KI;+l [Qle+l = [MI;+I [Qle+i [Rle+i

6. A new set of improved eigenvectors is:

7. Test for convergence on w:. Repeat steps 2 to 6 until convergence is achieved.

Table 2.1: Roots of the 11th Order Lobatto Rule

2.4 Multitasking:

Multitasking, defined as the structuring of a program into two or more tasks, which can execute concurrently, is Cray Research Inc.'s officially released and supported implementation of parallel processing. There are two methods of multitasking available: the first is macrotasking, best suited for programs with larger long running tasks. The second method (microtasking) is beneficial for programs with shorter running tasks. All references, examples and developments of multitasking herein shall refer to the method of macrotasking.

The programmer must explicitly code his/her FORTRAN subroutines so they can run in parallel. Multitasking subroutines can be used to decrease execution time of a complete program; but a pardel job not efficiently multitasked could take more time than a job that is sequential, due to the extra overhead of multitasking. To activate the multitasking subroutines on the Cray X-MP/24 computer you must use the Cray control verb MULTI in the job control language (JCL) deck.

The Cray X-MP/24 computer is a tightly coupled multiple instruction multiple data

(MIMD) machine which can execute different instructions and operate on different data, i.e. possesses N independent processors each having its own control unit. Memory on the Cray

X-MP multiprocessor system can be accessed independently or in parallel during execution.

The system has low overhead of task initiation for multitasking and has proven to be very efficient (Hwang and Briggs, 1984).

Consider a finite element problem broken down into N domains as shown in Fig. 2.1.

Each domain will be assigned a task, defined as a unit of computation that can be sched- uled and whose instructions must be processed in sequential order, meaning each task is a separate subprogram. The tasks are initiated by a TSKSTART subroutine located in the multitasking library of the Cray computer. When calling the TSKSTART subroutine, a taskarray (task control array used for this task) and a name (entry point at which task execution begins) must be passed in the parameter list. An optional list of arguments can also be passed. In our case, the specific domain will be passed along with the taskarray and name. Therefore, the following call statement and do-loop will be needed to produce Figure 2.1: Finite Element Problem Subdivided into N Domains a multitasking environment with each domain assigned a separate central processing unit

(CPU).

DO 10 I = l,N

CALL TSKSTART (PROCESS(l,I), BEGIN, DOMAIN(1))

10 CONTINUE

- Where N is the number of domains in the finite element model.

Not all domains will be completed at the same time, a TSKWAIT subroutine will be called to wait for the indicated task to complete execution. For each TSKSTART called, a

TSKTfla41Tmust also be called so that all subprograms end at the same time. The taskarray is the only parameter needed in the call list. The TSKWAIT subroutine is called through a do-loop, similarly, as the TSKSTART was called earlier.

DO 20 I = l,N

CALL TSKWAIT (PROCESS(1,I))

20 CONTINUE

The TSKSTART and TSKWAIT create the pardelism needed to solve the finite element problem using multitasking. Each domain will have its own task to be performed by a separate CPU. This will enable the parallel finite element algorithm to execute faster than a sequential finite element problem.

Communication between tasks/domains will be needed to solve the finite element problem. As the tasks are executing certain variable values must be transmitted between tasks/domains. To guarantee that the values are computed in one task before they are used in another, the correspondence must take place at a synchronization point, defined as a point in time at which a task has received the go-head to proceed with its processing.

Therefore, one task computes the value before the synchronization point, and the other tasks reference the value only after the synchronization point.

The facility that allows signaling between tasks is called an event, which has two states: cleared and posted. When an event is posted it has reached the synchronization point and the variable can be used in other tasks. If the event is cleared, no waiting is needed because the variable has already been posted and cleared for all tasks to continue. The event is identified by an integer variable passed through the subroutine EVASGN. An event variable cannot be used unless this subroutine is called before any other event subroutines.

Therefore, EVASGN passes an integer variable used as an event and an optional value if needed. The following example will invoke the subroutine:

DO 30 I = l,N

CALL EVASGN(EVENT(1))

30 CONTINUE

- Where N is the number of domains in the finite element problem.

Three other subroutines will be called along with the EVASGN subroutine and they are EVWAIT, EVPOST, EVCLEAR. Each of these subroutines are needed to complete the process of communication between tasks when a variable is needed by more than one task working in parallel. The three subroutines also must pass the same integer variable as the

EVASGN subroutine to link all subroutines to the same event.

The EVWAIT subroutine waits until the specified event is posted, but the task resumes

execution without waiting if the event is already posted. Subroutine EVPOST returns

control to the calling task after the subroutine posts the event. By the event being posted all other tasks waiting on that event may resume execution. In addition, EVCLEAR clears an event and returns control to the calling task, but if the variable is already cleared then execution continues.

The example program MULTI will help show the uses of these multitasking subroutines.

It is assumed that the program will run on a MIMD computer possessing 2000 physical

processors. For clarification, a horizontal row of dots in the example program takes the place of executable FORTRAN statements that are unimportant in the presentation of the multitasking technique.

2.4.1 Program Multi:

C C+++ THIS PROGRAM SHOWS HOW MULTITASKING CAN BE ACHIEVED IN r* b C+++ GENERATING A STIFFNESS MATRIX FOR 2000 ELEMENTS CONCUR- C c+++ RENTLY. FOR EACH ELEMENT A TASK IS ASSIGNED, A GLOBAL C C+++ TASK WILL RECEIVE ALL ELEMENT STIFFNESS MATRICES AND AS- r! C+++ SEMBLE THEM INTO A GLOBAL MATRIX. THE CALCULATION OF THE C C+++. . ELEMENT STIFFNESS MATRICES SHOULD BE 2000 TIMES AS FAST AS C Cf ++ A SEQUENTIAL ALGORITHM PERFORMING THE SAME COMPUTATIONS. C Cf ++ DOMAIN = NUMBER OF THE TASKIELEMENT. C C C+++ MAIN PROGRAM C

EXTERNAL BEGIN INTEGER EVENT1(2000), PROCESS(1,2000), DOMAIN(2000) COMMON/EVENTS/EVENTl C C+++ DATA DECLARATION C DO 5 I = 1,2000 PROCESS(1,I) = 3 DOMAIN(1) = I 5 CONTINUE C C+++ EVENT ASSIGNMENTS C DO 10 I = 1,2000 CALL EVASGN(EVENTl(1)) 10 CONTINUE C Cttt START DOMAIN TASKS C DO 20 I = 1,2000 CALL TSKSTART(PROCESS(l,I), BEGIN, DOMAIN(1)) 20 CONTINUE C C+++ START GLOBAL TASK C CALL ASSEMBLE C C+++ TASK COMPLETION C DO 30 I = 1,2000 CALL TSKWAIT(PROCESS(1,I)) 30 CONTINUE C C+++ CLEAR ALL EVENTS C DO 40 I = 1,2000 CALL EVCLEAR(EVENTl(1)) 40 CONTINUE C STOP END SUBROUTINE ASSEMBLE INTEGER EVENTl(2000) COMMON/EVENTS/EVENTl C C+++ WAIT AND CLEAR ALL EVENTS. C DO 10 I = 1,2000 CALL EVWAIT(EVENTl(1)) CALL EVCLEAR(EVENTl(1)) 10 CONTINUE C C+++ READ ALL OF THE ELEMENT STIFFNESS MATRICES FROM THE TAPE IT C C+++ WAS WRITTEN TOO IN SUBROUTINE BEGIN. C ...... C C+++ ASSEMBLE THE ELEMENTS INTO A GLOBAL STIFFNESS MATRIX. C ...... C RETURN END

SUBROUTINE BEGIN (DOMAIN) INTEGER EVENTl(2000) COMMON/EVENTS/EVENTl INTEGER DOMAIN C C+++ READ THE NECESSARY DATA. C ...... C C+++ COMPUTE THE STIFFNESS MATRIX FOR THE ELEMENT. C ...... C C+++ WRITE THE ELEMENT STIFFNESS MATRIX TO A TAPE, THEREBY C C+++ ALLOWING THE SUBROUTINE ASSEMBLE TO ACCESS THE INFORMATION. C

C+++ POST THE EVENTS AS THEY ARE FINISHED C CALL EVPOST(EVENTl(DOMA1N)) C RETURN END

2.5 Parallel Implementation of a Multi-Frontal Subspace Method:

Before presenting the strategy of the frontal solution, modified subspace method a.nd rnulti- tasking in a multi-processing environment, the scenario of the problem will be created and some basic assumptions are given. Referring to Fig. 2.2, a linear elastic f nite element model can be divided into N domains each of which is assumed to possess m elements. Thereupon, creating a global front along lines AB, BC, BD and BE and local fronts along ABC, EBD and CBD. Only nodes shall be on the global front and all elements must be in a domain even though they may lie on the global front.

The eigenpairs will be calculated using a MIMD parallel computer that will have N concurrent processors, with one processor assigned to each domainltask. All processors will begin operating within each domain and will converge to the local front. ill1 nodes located on each domain's local front comprise the global front of the entire structurr.. 'IY hen reaching the local front each domain processor will wait for all the other tasks to reach this critical sy~lchronizationpoint, the notles lying on the local fronts will be commu~~icatetfto all tllc Figure 2.2: Parallel Processing within Domains other domain processors when all tasks reach this point. Upon receiving all local fronts, each domain processor will assemble and eliminate the nodes on the global front and will immediately proceed to diverge from the local front and complete the solution process.

Simultaneously, the stiffness and mass matrices, [I

Specifically, a wave front advances in each separate domain until it reaches its local front, (Fig. 2.2) in the i - th domain the local front is ABC. As a result, the stiffness matrix is constructed and elimina.ted within each domain, entirely independent of all other domains until the local front (ABC) is reached. The i - th domain coefficients at any point in time during the process can be expressed as:

where k,j and b;, are the sum of the stiffness and right ]land-side contributions, and k,, is the pivot used in Gauss elillliriation for determining the variable vs, of the fully surnnled equation.

From the assernbly and elimination process within the i - 11~domain and at iteration t, a set of dense submatrices are obtained:

[~l'[~lS;l= [~lf (2.12) where [K]~is the partially processed stiffness matrix, [v];/~the approximate eigenvectors and [B]:the corresponding right hand sides. A closer look at equation (2.12) shows that the matrices can be decoupled in accordance with substructuring of the finite element model into the local front (f) and domain (d) submatrices.

The carried out matrix multiplication yields the subsequent equations:

where [UIdis a processed upper triangular matrix, [K]rcontains the unprocessed degrees of freedom (dof) on the local front, [KIdfare the dof within domain i but associated with the local front, [V];are the unknown variables, [V];are the unknown variables along the local front; again, all values are generated within domain i and are in the !- th iteration.

[Bid and [BIf are the domain and local front right hand-sides respectively.

With the assembly and elimination process completed in each domain, all local front matrices, the stiffness [KIf and right hand-side [B]r,will be communicated to all other domains. Whereupon, receiving each local front matrices and assembling the global front matrices of the entire structure:

the domain processors proceed to use the Gauss-Jordan elimination method to solve for

[V]j in back-substitution.

[Wff [Vlff = [Blff (2.17)

Since [V] is a superset of all [V]j, back-substitution can immediately begin and no waiting is needed in domain i to continue in the solution process.

Referring back to the basic algorithm of the modified subspace method; it can be seen that most computations are performed on an element by element basis. In the second step, the solution of [K]-l[Ble uses the frontal solution to perform the simultaneous elimination of variables, concurrently within each domain. The q starting vectors of [VI1, the modification of [V];+l to increase convergence rate and the calculation of the updated vectors in [V]e+l

(until required accuracy is achieved) are also done concurrently for all elements in the domain (steps 1, 3 and 6). In addition, the calculation of [K]*~and [MI*; for the current subspace can be calculated in the separate domain processors, as shown in the ensuing equations.

The results of these matrices are sent to all domains to be assembled with every other domain's [K]: and [MI:. N [Kl* = C[KI; Following assembly, the auxiliary eigenproblem, equation (2.8), can be solved and tested for convergence on w:, if convergence is not met, the process will be repeated.

To successfully implement the frontal solution and modified subspace method, by dividing a finite element model into N domains, in a parallel environment depends on estab- lishing timely communication between the domain processors. Ideally, the goal is to keep the processors busy at aU times by dividing the amount of work equally among the domain processors. In addition, numbering of elements and choosing the most suitable global front are essential in achieving speedup and efficiency for the parallel algorithm. The use of multitasking subroutines will create the multi-computer setting and the transmission between the domain processors.

The characteristics of the frontal solution, modified subspace method and multitasking have been presented. The advantages of using these two numerical methods in a multi- processing environment, by utilizing the multitasking subroutines on the Cray X-MP/24 supercomputer, has been shown. Furthermore, both methods have proven to be effective algorithms for the solution of large finite element problems in parallel and sequential processing (Melhem, 1985, Storaasli, et al., 1988, Melosh and Bramford, 1969 and Bathe, 1982).

In the following chapter, a more comprehensive description will be given for the algorithm developed and the multitasking implementation will be given. Chapter 3

A PARALLEL PROCESSING PROGRAM FOR LARGE FINITE ELEMENT EIGEN ANALYSIS

3.1 Introduction:

The eigenvalue analysis problem associated with structural vibration is a major, computa- tionally intensive activity in large-scale finite element calculations, therefore, a computer program is developed based upon the unique architecture of parallel processor computers to decrease the total execution time of the analysis. The FORTRAN program developed utilizes the MIMD Cray X-MP/24's parallel processing capabilities through the multitasking libraries. In the following sections, a full description of the algorithm is given, also, a skeleton outline is provided to help grasp the technique of multitasking used in achieving successful implementation. The multi-frontal algorithm was developed from the sequential program "FEDA" (Akl,

1978 and Akl, et al., 1979). Restructuring and modifications were made in the input, pre- front, model matrices and solution-resolution routines, as well as the insertion of multitasking routines to exploit the computer's parallel processing powers.

3.2 Parallel Architecture for Communication:

Timely correspondence between domains/tasks for the finite element model are vital in producing a successful performance. Information must be passed between each CPU at synchronization points within the code. The algorithm favorably uses a completely connected system, Fig. 3.1, i.e. every CPU in the system will be able to communicate with every other CPU, to pass all information with minimum overhead. This architecture takes advantage of the Cray X-MP/24's central memory which can be accessed independently or in parallel. Each domain task will have its own local memory to use at any time during the execution. In addition, shared memory is available for task transmission between domains by using common blocks to obtain the information, but, only after the information has been computed by the other tasks.

There are three significant synchronization points embedded in the program to give all tasks the go-head to continue after they (tasks) have received the following information:

r At the first synchronization point all input data is sent to every task except task one

which read in the data.

r [K]jand [B]jare obtained at the second synchronization point. Figure 3.1: Completely Connected Parallel Architecture

[K]:and [MI: are collected at the third synchronization point.

A flow chart of the parallel algorithm is shown in Fig. 3.2.

3.3 Facilities of Program:

The parallel FEDA program has the following facilities available:

1. Eigen analysis of linear elastic finite element systems to obtain the q lowest eigenpairs

(A;, (4);) in a n-dimensional system described by the following equation:

where qsn. DOMAIN i

Data, [Kle,[MIe, [VI?

PIP = [MI"[Vl,e Assembly/elirnination (elimination of B in resolution)

[Uld [Klfd I [Bid I-,,,, I 111

Wait for domains Receive [Klf and [Blf [KIff = CNIKlf, [Blff = CNIBlf

* [Klff[Vlff = [Blff

Back-substitution [V];T1

[VI;;, + [Vl;:, - Pl[VIP [Kl; = L[vIi:l [KI[VIi+i [MI; = c,[v];,T,[M][V]~~

Wait for domains Receive

LKl; and [K]* = ENIKl:[Ml* = EN[M]~ [K]*[Ql= [Ml*[QI[Ql [Vlf+, = [VIi;l[QI

False

Figure 3.2: Parallel Algorithm for the i - th Domain 2. The user has the option of choosing the required tolerance level that will be placed

upon the q eigenpairs.

3. Static analysis of the finite element system:

4. Items one and three can be solved concurrently on N processors where NS8 and will

have a lower execution time compared to the sequential FEDA program.

5. The following elements are available:

(a) Two-dimensional beam element (Paz, 1985).

(b) Two or three-dimensional truss element (Paz, 1985).

(d) An isoparametric thin plate element (Baldwin, et al., 1973).

Furthermore, the program was designed so that additional elements could be installed

very easily.

3.4 Assumptions and Restrictions:

Some basic assumptions and restrictions apply to the parallel FEDA program:

1. Structures or solids are assumed to behave in a linearly elastic manner.

2. Rigid body modes are eliminated. 3. The matrices [K] and [MI are symmetric; [K] is positive definite and [MI may be only

positive semi-definite.

4. The program is only functional on a Cray supercomputer with multitasking capabili-

ties.

5. The number of domains/tasks shall never be greater than eight.

6. If an element lies on the global front, it must be defined in one of the domains.

3.5 Parallel FEDA:

The description, program format and multitasking technique will be presented. This information is shown in s FORTRAN stractured outline, showing all call statements to the program and multitasking library subroutines. A description is given for each subroutine above the call command in the comment statements. The outline is not complete, it is only a tool to help the reader digest the logic of the program in a multi-processing environment. A row of horizontal dots takes the place of executable FORTRAN statements that are insignificant in the presentation of Parallel FEDA.

3.5.1 Nomenclature:

The following variable names used in the outline of parallel FEDA are defined for the reader's benefit:

DOMAIN(NFR0NT) - the integer value assigned to each task called in the main

program with respect to the substructuring of the finite element model. r DOMAIN - integer value unique to each domain/task corresponding to the finite

element model.

DOMFRONT - subroutine name called in the main program for each domain assigned

in the finite element model. r EVENT(NFRONT,NFRONT) - an integer variable representing an event to allow

signaling between tasks.

NCASE - maximum number of iterations in subspace eigen analysis. r NFRONT - total number of domains or tasks the finite element problem has been

divided into.

NRESOL - the current iteration number in eigen analysis. r NSTOP - an integer variable assigned 0 at the beginning of the iteration process, if

the tolerance is achieved the variable will be assigned 10. r PROCESS(3,lO) - task control array used for the task assignments. The first dimen-

sion must be set and the second dimension is a unique task identifier. 3.5.2 Outline of Parallel FEDA:

MAIN PROGRAM

C C A PARALLEL FRONTAL SOLUTION OF FINITE ELEMENT SYSTEMS C C USING A MODIFIED SUBSPACE APPROACH TO EXTRACT THE C C EIGENPAIRS C

INTEGER PROCESS(3,10), DOMAIN(10) INTEGER EVENT(10,lO) C C+++ PUT EVENT IN COMMON BLOCK SO THAT ALL DOMAINS CAN ACCESS IT. C COMMON /EVENTS/ EVENT C C+++ DATA DECLARATION C DO 10 I = 1,NFRONT PROCESS(1,I) = 3 DOMAIN(1) = I 10 CONTINUE - C+++ EVENT ASSIGNMENTS, IDENTIFIES THE INTEGER VARIABLES THAT C C+++ THE PROGRAM INTENDS TO USE AS AN EVENT. C DO 20 I = 1,NFRONT DO 20 J = 1,NFRONT CALL EVASGN(EVENT(1,J)) 20 CONTINUE C C+++ START ALL DOMAIN TASKS EXCEPT DOMAIN 1. C DO 30 I = 2,NFRONT CALL TSKSTART(PROCESS(l,I), DOMFRONT, DOMAIN(1)) 30 CONTINUE b C+++ START DOMAIN 1, WHICH IS CALLED SEPARATELY TO LIMIT THE C C+++ INPUT DATA TO JUST ONE DOMAIN. C CALL DOMFRONT(DOMAIN(1)) C C+++ UPON TASK COMPLETION, TELL EACH DOMAIN THAT DOMAIN 1 HAS C C+++ BEEN COMPLETED. C DO 40 NF = 1,NFRONT CALL EVPOST(EVENT(1,NF)) 40 CONTINUE C C+++ WAIT FOR TASKS TO BE COMPLETED THAT WERE CALLED BY TSKSTART. C DO 50 NF = 2,NFRONT CALL TSKWAIT(PROCESS(1,I)) 50 CONTINUE C STOP END

SUBROUTINE DOMFRONT(D0MAIN) INTEGER DOMAIN C IF( DOMAIN .EQ. 1 ) THEN C C+++ READ INITIAL DATA. C C C+++ DIAGNOSE THE INITIAL DATA, IF FATAL OR NONFATAL ERRORS OCCUR C C+++ CALL DOCTOR, AND SETUP HOUSEKEEPING FOR DYNAMIC DIMENSIONING C C+++ OF VECTOR ARRAYS. THE SUBROUTINE DOCTOR WILL IDENTIFY AND C C+++ LIST THE ERROR MESSAGES FOUND IN THE DATA. C CALL DNURSE C C+++ READ THE REMAINING DATA, e.g. ELEMENT TYPES, NODAL C C+++ COORDINATES etc., FOR THE PROBLEM AND ASSIGN THEM TO C C+++ VARIABLES IN A COMMON BLOCK, THAT WILL BE ACCESSED BY ALL C C+++ DOMAINS. C CALL FINPUT C C+++ TELL ALL DOMAINS, EXCEPT DOMAIN 1, THAT ALL DATA HAS C C+++ BEEN READ IN. C DO 10 NF = 1,NFRONT CALL EVPOST(EVENT(1,NF)) 10 CONTINUE C ELSE C C+++ WAIT FOR THE DATA TO BE READ INTO DOMAIN 1. C CALL EVWAIT(EVENT(1,DOMAIN)) C C+++ SETUP HOUSEKEEPING FOR DYNAMIC DIMENSIONING OF VECTOR C C+++ ARRAYS. C CALL DNURSE C C+t+ ASSIGN ALL DATA RECEIVED IN THE COMMON BLOCK FROM C C+++ DOMAIN 1 TO THE CORRECT VARIABLES. C CALL DINPUT C C+++ CLEAR EVENTS SO THAT THEY MAYBE USED AGAIN. C CALL EVCLEAR(EVENT(1,DOMAIN)) C ENDIF C C+++ CHECK FOR THE LAST APPEARANCE OF EACH NODE, WHEN IT IS C C+++ FOUND MAKE THE NODE NEGATIVE (CREATING THE PRE-FRONT). C C+++ ALSO, THE SIZE OF THE GLOBAL FRONT IS DETERMINED AND A C Cf +tFEW MORE DIMENSIONS ARE CALCULATED THAT ARE NEEDED IN C C+++ SUBROUTINE DFRONT. C CALL DMATRON C C+++ CREATES THE ELEMENT STIFFNESS AND MASS MATRIX FILES C C+++ FOR THE SPECIFIC ELEMENT TYPE, IN ADDITION, CREATES C C+++ THE ELEMENT LOAD AND EIGENVECTOR FILES. C CALL ESTIFF C DO 6 NRESOL = 1,NCASE C C+++ SOLVES THE SET OF LINEAR SIMULTANEOUS EQUATIONS USING C C+++ THE FRONTAL TECHNIQUE. C CALLDFRONT C

SUBROUTINE DFRONT C C+++ ASSEMBLE THE STIFFNESS MATRIX AND ELIMINATE THE DOF IN C C++t THEIR LAST APPEARANCE UP TO THE LOCAL FRONT. C ...... C C+++ AFTER THE LOCAL FRONT IS REACHED BEGIN WORK ON THE C C+++ GLOBAL FRONT. C ...... C C+++ TELL ALL OF THE OTHER DOMAINS THAT YOU HAVE COMPLETED THE C C+++ ASSEMBLY OF THE LOCAL FRONT LOCATED WITHIN THIS DOMAIN. C DO 100 NF = 1,NFRONT IF ( NF .NE. DOMAIN ) THEN CALL EVPOST(EVENT(DOMAIN,NF)) ENDIF 100 CONTINUE C C+++ WAIT FOR ALL DOMAINS TO REACH THEIR LOCAL FRONTS. C DO 200 NF = 1,NFRONT IF ( NF .NE. DOMAIN ) THEN CALL EVWAIT(EVENT(NF,DOMAIN)) ENDIF 200 CONTINUE C C+++ CLEAR THE EVENT FOR FUTURE USE. C DO 300 NF = 1,NFRONT IF (NF .NE. DOMAIN ) THEN CALL EVCLEAR(EVENT(NF,DOMAIN)) ENDIF 300 CONTINUE C C+++ ASSEMBLE THE GLOBAL STIFFNESS MATRIX AND PERFORM A GAUSS C C+++ JORDAN ELIMINATION TO SOLVE THE UNKNOWN VARIABLES C C+++ LOCATED ON THE GLOBAL FRONT. C ...... C C+++ BEGIN THE BACK-SUBSTITUTION TO SOLVE FOR ALL UNKNOWN C C+++ VARIABLES IN THE DOMAIN. C ......

RETURN

C C+++ SOLVE FOR THE EIGENPAIRS. C CALL DCONDNS C

SUBROUTINE DCONDNS C+++ PROJECT THE STIFFNESS AND MASS MATRICES ONTO THE CURRENT C C+++ SUBSPACE FOR EACH ITERATION. C ...... C C+++ TELL ALL DOMAINS THAT THIS TASK HAS BEEN COMPLETED. C DO 100 NF = 1,NFRONT IF ( NF .NE. DOMAIN ) THEN CALL EVPOST(EVENT(DOMAIN,NF)) ENDIF 100 CONTINUE C C+++ WAIT FOR DOMAINS TO REACH THIS POINT. C DO 200 NF = 1,NFRONT IF ( NF .NE. DOMAIN ) THEN CALL EVWAIT(EVENT(NF,DOMAIN)) ENDIF 200 CONTINUE C C+++ CLEAR THE EVENT. C DO 300 NF = 1,NFRONT IF ( NF .NE. DOMAIN ) THEN CALL EVCLEAR(EVENT(NF,DOMAIN)) ENDIF 300 CONTINUE C C+++ ASSEMBLE [K];AND [MI: FROM ALL DOMAINS. C ...... C C+++ SOLVE THE AUXILIARY EIGEN PROBLEM FOR EACH ITERATION BY A C C+++ PSEUDO-JACOB1 METHOD (IRONS 1968). C CALL EIGN C C+++ TEST THE EIGENVALUES FOR CONVERGENCE. IF TOLERANCE IS C C+++ MET THEN SET NSTOP = 10. A BETTER M-ORTHONORMALIZED C C+++ APPROXIMATION OF THE REQUIRED EIGEN VECTORS IS CONSTRUCTED...... ,......

RETURN

C IF ( NSTOP .NE. 0 ) GO TO 70 C Cttt AFTER EACH ITERATION, CALCULATE A NEW SET OF EIGENVECTORS C C+t+ THAT WILL BE USED IN THE NEXT ITERATION. C CALL RELOAD C 6 CONTINUE C C+++ PRINTS THE OUTPUT FOR MODE SHAPES AT PRESCRIBED NODES C C+++ RATHER THAN THE CUSTOMARY ELEMENT-BY-ELEMENT OUTPUT. C 70 CALL PLOTING C C+tt WAIT FOR TASKS TO BE COMPLETED AND CLEAR EVENTS. C IF ( DOMAIN .NE. 1 ) THEN CALL EVWAIT(EVENT(1,D OMAIN)) CALL EVCLEAR(EVENT(1,DOMAIN)) ENDIF C RETURN END Chapter 4

APPLICATION OF PARALLEL FEDA

4.1 Purpose:

This chapter has a three fold purpose:

1. Validate the algorithm's accuracy by comparing the results to the following: a se-

quential frontal/subspace algorithm called "FEDA" (Akl, et al., 1979), NASTRAN

(NAsa STRuctural ANalysis program) and in some cases analytical results. Note: All

NASTRAN eigenpairs were solved by the Tridiagonal (Givens) method (Schaeffer,

1979).

2. To show the capabilities and wide range of problems the algorithm can handle.

3. Attract attention to the speedup attained for each finite element problem by subdi-

viding the structure into two domains. As a consequence of only two physical CPUs

on the Cray X-MP/24 computer, two domains were chosen to get the exact speedup

between the sequential and parallel algorithm. The ideal or theoretical speedup for

38 the parallel solution would be 2.0.

Therefore, in the following pages a description, diagram, input properties, speedup and a table of eigenvalues solved by the different procedures for each structure analyzed will be presented. For all problems solved a tolerance of was imposed on each eigenvalue in the subspace. Located in Appendix A is a description of the input data used for Parallel

FEDA.

4.2 Finite Element Problems and Results:

4.2.1 Two-Dimensional Beam:

The first problem solved was a beam clamped at both ends (Figure 4.1) and having 20 elements with 57 degrees of freedom (dof), i.e. three dof at each node; this rather simple problem was chosen to assist in the initial developmental efforts. The global front for this beam consisted of one node located at the center of the beam. As a result, there are two domains/tasks, the right half is domain one and the left half is domain two. In NASTRAN, the beam was modeled using the CBAR element (Schaeffer, 1979) along with a consistent mass matrix. Speedup for the beam is 1.49; due to the simplicity and small number of elements the speedup is low. Total number of iterations for the parallel and sequential programs where 14 and 15 respectively. Table 4.1 contains the four lowest eigenvalues solved by parallel and sequential FEDA, NASTRAN and a closed-form solution. Beam with Both Ends Fixed

SPEEDUP = 1.49

Number of Nodes: 21

Number of Elements: 20

Input Properties:

Figure 4.1: Beam Idealization Table 4.1: Clamped- Clamped Beam Eigenvalues

ORDER OF EIGENVALUES PREDICTED BY EIGENVALUE PARALLEL SEQUENTIAL NASTRAN CLOSED FEDA** FEDA** FORM* 1 19.55 19.55 19.55 19.55 2 148.6 148.6 148.6 148.6 3 571.1 571.1 570.9 571.0 4 1560.8 1561.0 1559.6 1560.3

**q=g * See Paz (1985)

4.2.2 Space Truss:

Two structures were analyzed using the three dimensional truss element, with all members having three degrees of freedom at each node. Both space trusses were constructed on

NASTRAN and use the CBAR element (Schaeffer, 1979) with all rotations fixed and a lumped mass matrix. The first space truss (Bostic and Fulton, 1986) used in solving for the eigenpairs has 88 members and 26 joints with the four base nodes fixed, shown in Figure 4.2, as a result the truss contains 66 dof. There are four nodes on the global front, located atop the third tier from the bottom, and an equal number of elements in their respective domains.

Property set one belongs to all members that makeup the five box subtrusses (horizontal and vertical elements are 20 ft in length) while the eight members that protrude from the sides pertain to property set two and have a horizontal length of 40 ft. For the space truss, a speedup of 1.77 was achieved with both parallel and sequential algorithms performing Space Truss

SPEEDUP = 1.77

Number of Nodes: 26

Number of Elements: 88

Input Properties:

Property Set 1 Property Set 2

E = 3.9385310 psi E = 3.9385310 psi

Figure 4.2: Idealization of Space Truss Table 4.2: Space Truss Eigenvalues

ORDER OF EIGENVALUES PREDICTED BY EIGENVALUE PARALLEL SEQUENTIAL NASTRAN LANCZOS FEDA** FEDA** METHOD* 1 2.104304 2.104304 2.104304 2.104304 2 3.075304 3.075304 3.076304 3.076304 3 3.624304 3.624304 3.524304 3.625304 4 1.819305 1.819305 1.819305 1.819305 5 3.869305 3.869305 3.869305 3.869305 6 9.168305 9.168305 9.168305 9.170E05

** q = 12 * See Bostic and Fulton (1986) eight iterations until convergence was met. In addition, a comparison of the six lowest eigenvalues are located in Table 4.2.

The second space truss considered is an open helicopter tail-boom structure (Arora and Nguyen, 1980). There are 108 truss members and 28 nodes as shown in Figure 4.3, the four left end nodes are fixed with the structure possessing 72 dof. Keeping both domains balanced an equivalent number of elements are assigned to each separate task. The structure is subdivided at its midpoint and consists of four nodes on the global front. An abnormally high speedup of 2.10 was calculated for the tail-boom structure because of the smaller number of iterations taken by the parallel program (15) compared to the sequential program

(20). Figure 4.4 shows the geometry and lengths of the finite element model, furthermore,

Table 4.3 contains the eigenvalues of the tail-boom. Helicopter Tail-Boom

SPEEDUP = 2.10

Number of Nodes: 28

Number of Elements: 108

Input Properties:

E = 1.05E08 psi

A = 1.0 in2

I = 1.OE02 in4

p = 2.5883-03 lb sec2/in4

Figure 4.3: Helicoptor Tail-Boom Idealization BASE OF EFJO OF TAILBOOM TAILBOOM

TOP VlEW IX3

FRONT VlEW

Figure 4.4: Helicopter Tail-Boom Structure: (a) Geometry of Tail-Boom; (b) Finite Element Model for the Tail-Boom Structure (Arora and Nguyen, 1980) Table 4.3: Eigenvalues for Helicopter Tail-Boom

ORDER OF EIGENVALUES PREDICTED BY EIGENVALUE PARALLEL SEQUENTIAL NASTRAN SUBSPACE FEDA** FEDA* * ITERATION* 1 1.848304 1.848304 1.846304 1.880304 2 2.076304 2.076304 2.077304 2.113304 3 3.091305 3.091305 3.090305 4.075305 4 3.723305 3.723305 3.719305 4.370305 5 4.160305 4.160305 4.163305 4.553305 6 1.574306 1.574306 1.574306 1.593306

** q = 12 * See Arora and Nguyen (1980)

4.2.3 Plane Stress Example:

Referring to Figure 4.5, a rectangular isoparametric plane stress element comprising of eight elements in the form of an inverted T is analyzed. The T-section was modeled on

NASTRAN using a QUAD8 element with the rotations fixed and a coupled mass matrix.

Shown in Table 4.4 are the eigenvalues determined by parallel and sequential FEDA and

NASTRAN. For the structure, each square element has eight nodes with two dof each and a length of 2.0 in. Moving across the bottom horizontally, all nodes are clamped which leaves 60 dof to displace. Substructures are formed by a vertical global front in the center of the structure which includes seven joints. Due to the lower number of iterations for the parallel program (16) matched against the sequential program (21), speedup equaling 2.31, once again was jutting above the theoretical speedup. Plane Stress T-Section

SPEEDUP = 2.31

Number of Nodes: 39

Number of Elements: 8

Input Properties:

E = 1.0 psi

v = 0.4

t = 1.0 in

p = 1.0 Eb sec2/in4

Figure 4.5: Idealization of T-Section Table 4.4: Eigenvalues for Plane Stress T-Section

ORDER OF EIGENVALUES PREDICTED BY EIGENVALUE PARALLEL SEQUENTIAL NASTRAN FEDA** FEDA** 1 3.8063-03 3.806E-03 3.8223-03 2 2.0543-02 2.0543-02 2.0553-02 3 2.3443-02 2.3443-02 2.3413-02 4 7.2233-02 7.2233-02 7.1513-02 5 9.1953-02 9.1953-02 9.1653-02 6 1.2333-01 1.2333-01 1.2293-01

4.2.4 Isoparametric Plate:

The final problem solved is a cantilevered plate which has a hole within the structure shown in Figure 4.6. Free vibration analysis of the 116 element plate on NASTRAN modeled by the QUAD4 element and an uncoupled mass matrix was performed. The left end rotations and translations are fixed, moving in the vertical direction from top to bottom for all nodes.

The nodal configuration for a square isoparametric thin plate used in analysis comprised of four corner deflections and 12 slope variables (two at each comer node and one at the midside nodes), in addition, there are 701 dof for the total structure and 12 nodes lying on the global fiont located at the center of the plate along a horizontal line. The length of each plate element is 2.0 in. Speedup of the parallel algorithm is 1.84 with the number of iterations (21) being equivalent in the parallel and sequential algorithms. Due to the slight differences in the stiffness matrices determined by parallel FEDA and NASTRAN, the eigenvalues in Table 4.5 are not equivalent but have a difference of 1-3% for each mode. Cantilevered Plate with Hole

SPEEDUP = 1.84

Number of Nodes: 430

Number of Elements: 116

Input Properties:

E = 7.7E10 psi

v = 0.33

t = 1.0 in

p = 2.8303 Ib sec2/in4

Figure 4.6: Plate with Hole Idealization Table 4.5: Eigenvdues for a Plate with a Hole

ORDER OF EIGENVALUES PREDICTED BY EIGENVALUE PARALLEL SEQUENTIAL NASTRAN FEDA** FEDA** 1 504.8 504.8 511.9 2 3798.0 3798.0 3815.1 3 17730.0 17730.0 17583.0 4 34510.0 34510.0 34843.0 5 1.326E06 1.326306 1.292306 Chapter 5

INVESTIGATION OF PARALLEL FEDA

5.1 Preview:

To measure the success of the parallel algorithm on the Cray X-MP/24 supercomputer, two important factors will be determined: speedup (see section 1.2) and efficiency which measures the utilization of the parallel machine, e.g. if the processors are idle or require extra calculations introduced through parallelization of the problem the speedup and efficiency decrease (Schendel, 1984), the subsequent equations represent the speedup and efficiency:

Ts SPEEDUP = SP = - (21) TP SP EFFICIENCY = - < 100%) N (- (5.2) where: T, is the time of sequential algorithm.

T, is the time of parallel algorithm.

N is the number of processors used in the parallel solution. These evaluation tools were computed for the following size plates 8, 16, 24, 32, 40 and 64 elements, shown in Figs. 5.1 and 5.2. Variables affecting the assessment of the algorithm are:

1. The number of domains chosen.

2. Total number of elements and degrees of freedom (dof).

3. Formulation of the global front relative to the number of degrees of freedom on the

local fronts.

4. Direction the domain fronts move controlled by the element numbering scheme.

5. Number of iterations taken to achieve the required tolerance level.

6. Total number of eigenpairs (q) predicted.

As a result, a number of test runs were analyzed to examine these variables influencing the speedup and efficiency of the algorithm in a dedicated mode. This chapter presents the results obtained from a number of example problems for rectangular plate structures with all edges clamped (c). All structural plates use an isoparametric square plate element

(Baldwin, et al., 1973) of length 2.0 in; the model consists of four corner nodes and four mid-side nodes amounting to 16 degrees of freedom per element. The input properties for all plates are: Young's modulus is 1.0 psi, Poisson's ratio is 0.3, the mass density is 1.0 lb sec2/in4 and 1.0 in equaling the thickness. Either a tolerance level of or was placed upon all q eigenvalues. Figure 5.1: Different Size Plates Used in Analysis with all Edges Clamped: (a) 8 Element Plate (67 dof); (b) 16 Element Plate (115 dof); (c) 24 Element Plate (163 dof); (d) 32 Element Plate (211 dof); (e) 40 Element Plate (259 dof) Figure 5.2: A 64 Element Plate Clamped (c) on all Edges

5.2 Background to Testing:

Time functions were inserted into the algorithm at strategic points to define accurately the time needed to perform the calculations. The time elapsed from me station to another is wall-clock time and not the CPU time charged to the job (Cray Research, Inc., 1987). A distinction must be made between them as a consequence of a multiple processor job having a greater CPU time than an equivalent job on a single processor. The total CPU time for a multi-processor job will be labeled work done by the system and will not be equal to the wall-clock (execution) time, in contrast wall-clock and CPU time for a sequential job are nearly equivalent, i.e. total execution time will be the sum of the subroutine times for the slowest task in the parallel solution.

As reported earlier, the Cray X-LIP124 computer has only two physical CPUs but the program can handle up to eight logical processors, i.e. when the number of processors used in the parallel solution exceed the number of processors on the machine (physical processors), the processors are called logical processors. This must be kept in mind when assessing the speedup for domains/tasks greater than two on account of the extra waiting introduced at synchronization points; this time must be eliminated to get a more precise

execution time.

Shown in Fig. 5.3 is a time chart of a four logical CPU system working on a two

physical CPU machine with all tasks assigned the same amount of work. Tasks 3 and 4

cannot begin execution until tasks 1 and 2 reach the first synchronization point (AT1). The

two task processors working simultaneously will not finish exactly at the same time, but

within 10-30 milliseconds of each other, therefore it will be assumed that the task processors

are nearly equivalent in time. In a two processor system it was identified that minimal time,

approximately 0.02 milliseconds, was taken to post, wait and clear an event at the place of

communication if the event had been previously posted by the other task. At AT2 task 3

and 4 reach the same synchronization point that task 1 and 2 did earlier at AT1 and all

four tasks are ready to continue execution again. Following the logic of parallel FEDA's

main program, task one will always continue over the other tasks after all tasks reach the

same synchronization point along with the last task to post the same event, this can be

seen at point AT2 if task 4 posted its event after tasks 2 and 3. Therefore, when calculating

the total wall-clock time, for domains greater than two, only the actual time computing

will be summed and not the time accumulated by the idle domains waiting for an open

processor to continue execution. The total execution time for the four tasks in Fig. 5.3 will

be calculated as if there are N physical processors located on the machine, sliown in Fig.

5.4. The following conclusions can be drawn from Figs. 5.3 and 5.4:

1. Fig. 5.3 represents a two physical CPU system with four logical CPUs being used.

Fig. 5.4 shows the interpretation of how the total execution time is resolved.

2. Time begins at AT0 and DTO and ends at AT4 and DT2. Task 1

Task 2

Task 3

Task 4 AT0 AT1 AT2 AT3 AT4

Begin End

time-

Figure 5.3: Time Chart of Four Logical Processor System Running on a Two Physical Processor Machine

Dl1 Dl2 Task 1

D21 D22 Task 2

D3 1 D32 Task 3

D41 D42 Task 4 DTO DT 1 DT2

Begin End

time ->

Figure 5.4: Time Chart Showing the Interpretation of a Four Logical Processor System Running on a Two Physical Processor Machine 3. AT1, AT2 and DT1 are associated with the same synchronization point.

4. Refer to Figs. 5.3 and 5.4.

Tll + T12 2 Dl1 + Dl2

T21 + T22 2 D21 + D22

T31 + T32 2 D31 + D32

T41 + T42 2 D41 + D42

Input and output (I/O) for all sample runs were kept to a minimum amount. The SSD solid-state storage device used to read and write information on tapesldisks was bypassed because when one processor gains access to the device the other processor becomes idle.

When the problem of single-threaded 110 is resolved parallel FEDA will incorporate the

SSD device to significantly enhance its performance.

5.3 Evaluation of Varying Domains:

The most important feature of this research is the speedup obtained by the parallel solution.

In putting all other factors aside, the bottom line is to examine the substructured finite element model faster and accurately on a concurrent machine compared to a sequential machine while keeping the overhead to a minimum degree. It is expected that the speedup will increase as the number of processors increase with a theoretical limit set to N where N is the number of domains the finite element model has been subdivided into. As a reminder if N>2 then the N domains will be performed on N logical processors.

Table 5.1: Analysis of Various Domains for a 64 Element Plate Using a Subspace (q) of 2

Number of Figure SPEEDUP EFFICIENCY SPEEDUP EFFICIENCY Processors Number (Tol=10-~) (~ol=10-~) (~ol=10-~) (Tol=10-~) 1 5.2 1.OO 100% 1.00 100% [ 131 [I61 2 5.5 1.85 93% 1.86 93% 1131 [I61 4 5.7 3.86 96 % 3.13 78% PI [141 6 5.11 3.02 50 % 3.18 53% [I31 [I61 8 5.12 4.15 52 % 3.61 45% PI ~141

The value in [ ] is the total number of iterations to achieve the prescribed tolerance.

A 64 element rectangular plate, Fig. 5.2, containing 403 dof is tested to determine the speedup and efficiency on two, four, six and eight processors shown in Table 5.1 for q = 2 and Table 5.2 for q = 6. The decoupled plates are shown in Figs. 5.5-5.13 to show the global fronts and element numbering layout. Favorable results were obtained on the two and four processor models with the six and eight processor models developing trouble due to the high number of dof on the global front. The number of iterations taken to achieve tolerance plays a big part in determining the overall speedup of the system, e.g. a greater number of iterations in the parallel solution compared to the sequential solution will significantly lower the speedup. A thorough evaluation will be made on the various number of domains DOMAIN 1 DOMAIN 2

GLOB& FRONT

Figure 5.5: Two Domain Configuration of 64 Element Plate with Element Numbering Scheme

II IV DOMAIN 2 1 111

II IV DOMAIN 1 1 111

Figure 5.6: Two Domain Configuration with Horizontal Global Front for 64 Element Plate IV Vlll

'I' DOMAIN DOMAIN I II VI Ki

I v

Figure 5.7: Four Domain Idealization of 64 Element Plate with Element Numbering Scheme and Vertical Local Fronts

GLOBAL FRONT (TYP)

IV II DOMAIN 1 DOMAIN 3 Ill I

Figure 5.8: Four Domain Idealization with a Cross (+) Global F'ront for the 64 Element Plate DOMAIN 4

DOMAIN 3

DOMAIN 2 I DOMAIN 1 -GLOBAL FRONT (TYP) Figure 5.9: Horizontal Global Front with Four Domains on a 64 Element Plate

GLOABAL FRONT (TYP) /-

DOMAIN 1 DOMAIN 5 DOMAIN 6 np,DOMAIN 4 I_

Figure 5.10: Six Domain Configuration of the 64 Element Plate, Domains One and Six have 12 Elements Each all Other Domains Contain Ten Elements /- GLOABAL FRONT (TIP)

DOMAIN 1 D DOMAIN 6 Figure 5.11: Six Domain Idealization of a 64 Element Plate, Domains One anda Six have 12 Elements Each with all Others Containing Ten Elements

/--GLOABAL FRONT (TYP) mLnEDOMAIN DOMAIN DOMAIN DOMAIN CX3MAlN DOMAIN DOMAIN DOMAIN Figure 5.12: Eight Domain Configuration of the 64 Element Plate with Vertical Local Fronts 63

Table 5.2: Analysis of Various Domains for a 64 Element Plate Using a Subspace (q) of G

Number of Figure SPEEDUP EFFICIENCY SPEEDUP EFFICIENCY Processors Number (T01=10-~) (Tol= (Tol= (Tol=10-~) 1 5.2 1-00 100% 1.OO 100% PI [I51 2 5.5 1.83 92 % 2.08 104% 191 [I31 4 5.8 2.G7 67% 3.10 77% PI [I31 G 5.11 4.04 67% 5.30 88% 171 PI 8 5.12 2.27 28% 3.19 40% ~71 POI to sliow the advantages and deficiencies of parallel FEDA colnpared to sequential FEDA by looking at specific subroutines and commu~iicationlinks.

The two processor model (Fig. 5.5) performed consistently well, average speedup of

1.84, and even advanced above the theoretical limit for q = 6 and a tolerance of because of a lower number of iterations, Tables 5.1. and 5.2. LVIien comparing the results of Tables

5.1, 5.2 and 5.3 the two domain rnodcl gains ~no~nentumas the ~iunlbcrof iterations increase where tlte total number of sequential iterations are equivale~itto the total nuntber of parallel iterations. The global front llas only 13 dof which is the lowest possible number for a two donlain model. This is tlie only system where an evalrlation of tlte communicatio~tlinks roul(1 be verified. It was found that the overl~eadassociated with transmitting inforniation from one processor to another through common blocks was minirnal.

Scanning Tables 5.1 and 5.2 shows that the results from the four domain models in

Figs. 5.7 and 5.8 benefited greatly from a lo\ver r1rln11)erof iterations in the j)aralIel solutio~~. r GLOABAL FRONT (TYP)

Figure 5.13: Eight Domain Idealization of 64 Element Plate

In Table 5.2 the domain model in Fig. 5.8 was used over the model in Fig. 5.7 because

of the abnormally high number of iterations, 21 and 27, taken by the mode! in Fig. 5.7

to converge. The reason for the lower speedup in Table 5.2 is a consequence of a greater number of degrees of freedom (71) for Fig. 5.8 compared to 39 dof in Fig. 5.7. This concern

will be addressed later in the chapter. Therefore, to get a more comparable speedup look

to Table 5.3 where a limited number of iterations were placed upon a four domain problem.

Table 5.3: Evaluation of 64 Element Plate Limited to Two Subspace Iterations

Number of Figure SPEEDUP EFFICIENCY SPEEDUP EFFICIENCY Processors Number q = 2 q=2 q=6 q=6 1 5.2 1.00 100% 1.00 100% 2 5.5 1.68 84% 1.75 88% 4 5.7 2.84 71% 2.91 73% 6 5.11 2.88 48 % 3.00 50% 8 5.12 2.83 35% 3.18 40% The six and eight processor models did not do as well as expected, unless the number of iterations were smaller in the parallel solution. Two reasons are given for their deficiency:

1. The machine was overloaded due to the high number of logical processors. Due to the

problems with the SSD device all tasks must carry the complete input data rather

than just the domain data needed by the corresponding task. This problem is present

in the two and four processor models but is not as evident.

2. The larger number of dof on the boundaries of the models, because of this the frontal

technique is not being used to its fullest capacity.

5.4 Examination of Subroutines:

To initially start the multitasking package a main program was developed to set events and map out all domain processors. The time taken to perform this task was calculated to be between 20 to 60 milliseconds, which does not have a big impact on the total execution time. As mentioned earlier in Section 3.2, at the first synchronization point all input data is read into task one and passed to the other tasks because of problems with I/O (only single- threaded 110 available) on the Gray. This causes a 1.0 second delay until all processors can move forward again. The subroutine that handled the dynamic dimensioning of arrays has no speedup and takes approximately the same amount of time regardless of the number of processors but since it takes less than one millisecond to perform all calculations, the subroutine will be assumed negligible in the total execution time.

In DMATRON, the subroutine that determines the first and last appearance of all nodes in its domain has a very low speedup for all sizes of domains. This subroutine takes about 2% of the total execution time to complete, some overhead is accumulated in this

subroutine but is not critical to the total execution time. The creation of model matrices,

[KIe and [MIe, is the first place where significant speedup will be achieved because the finite element model is substructured into an equal number of elements in each task; the individual tasks should have an ideal speedup of 2.0, 4.0, 6.0 and 8.0 for two, four, six and eight processors in subroutine ESTIFF. Referring to Table 5.4, the two, four and eight domain structures had efficiencies of 89%, 90% and 90% for subroutine ESTIFF which

means some overhead has been complied at this point due to parallel processing. In the unbalanced six processor model (Fig. 5.11) the efficiency is only 80% as a result of the extra elements in domains one and six.

Table 5.4: Subroutine Speedups for Varying Domain Sizes with q = 6 (Refer to Table 5.3 for Figure Numbers and Section 3.5.2 for Subroutine Descriptions)

I RELOAD 1.92 3.84 5.06 7.64 After the element matrices have been generated, the program is ready to begin the solution-resolution process to determine the natural frequencies of the system. The first and most critical subroutine is DFRONT where the multi-frontal technique is implemented along with the assembly and elimination of the global front. The success of the parallel algorithm is dependent upon the amount of dof on the global front, as the number of domains increase so does the number of dof on the global front. This subroutine gets progressively worse as the number of dof on the global front and domains increase which can be seen in Table 5.4. The first iteration's execution time in DFRONT will always be greater than the remaining subspace iteration's execution time because of a lower number of calculations. All equations needed in resolve that were previously calculated are saved for future use, e.g. in the Gauss-Jordan method for the elimination of the global front all variable? divided by the pivot in their respective equations are saved for resolution.

A more comprehensive investigation will be conducted later in Section 5.6. The remaining two subroutines DCONDS and RELOAD perform the calculations of the modified subspace method. Some overhead is accompanied with these subroutines but overall their speedups were consistent and performed very well.

For the two processor model (Fig. 5.5) the percentage of total execution time used by the different subroutines in parallel FEDA are presented in Table 5.5. The sum of all values in Table 5.5 is 96% which leaves 4% of the total execution time due to the overhead of parallel processing. In conclusion7a recap of the overhead associated with parallel FEDA is given:

1. extra coding to implement the parallel processing.

2. extra storage requirements used by the separate task processors. Table 5.5: Percentage of Execution Time taken by Parallel FEDA for the Two Domain Model in Fig. 5.5

3. input of data and the map of the pre-front needed in the solution.

4. communication links used to pass information.

5. assembly and elimination of the global front performed within each task.

6. calculations that are not performed on an element by element basis.

5.5 Impact of Increasing Elements:

The effect of increasing the number of elements and degrees of freedom (dof) on the overall plate will be investigated. All other factors are kept constant, i.e. q = 6, two or six subspace iterations, always dividing the plate into two balanced domains, 13 dof on the global front, direction the wave front sweeps across the domain as shown in Fig. 5.14 and the size of the wave front. Plates consisting of 8, 16, 24, 32, 40 and 64 elements, Figs. 5.1 and 5.2 were used in this test. Referring to Fig. 5.15, the speedup is increasing at a constant rate for GLOBAL FRONT 7

DOMAIN 2

Figure 5.14: Subdivided Plate for all Test Runs with the Direction of the Local Fronts Shown both the two and six subspace iteration test cases as the number of elements increase. These results show that for large finite element problems a significant increase in computational speedup can be achieved if the number of elements per domain are large enough to overcome the deficiency of eliminating the global front. The percentage of execution time taken by

DFRONT becomes less of a factor as the number of iterations and elements increase. The

8 and 16 element plates for two subspace iterations are affected by the single- threaded 1/0 more than the larger size plates.

Looking at the execution time shows a vast improvement in the parallel algorithm with the added elements. However, another determinant to be evaluated is the work done or CPU time for the solution process; the speedup for the work done is displayed in Fig.

5.18. The ideal speedup would be 1.00 for the parallel solution, i.e. the amount of work done would be equal for both the parallel and sequential solutions. Although the work done for the parallel solution is greater than the sequential solution in all cases, the parallel

CPU time is approaching the sequential CPU time at a steady pace as the number of 1.80 -

TWO SUBSPACE ITERATIONS - SIX SUBSPACE ITERATIONS

NUMBER OF ELEMENTS

Figure 5.15: Effect of Lncreasing the Number of Elements on the Execution Time with all Other Factors Constant elements are expanded. This is a very notable factor when applying the parallel algorithm to enormous finite element models because the work done by the parallel solution is not increasing as rapidly as the sequential solution. Therefore, the overhead associated with parallel processing is becoming less noticeable as the number of elements increase with all major factors that effect the solution set at constant values.

5.6 Subspace Dimension:

In this study the number of eigenvalues and mode shapes will be increased to determine its impact on the algorithm. All factors are kept constant when using the 64 element plate shown in Figs. 5.2 and 5.5 with test runs limited to two subspace iterations. Displayed in

Fig. 5.17 is the speedup relative to the increasing eigenvalues (q = 2, 4, 6, 8 and lo), a

speedup of execution time shows a steady increase in the speedup from 1.68 to 1.75 for the NUMBER OF ELEMENTS

Figure 5.16: CPU Time Affected by Extra Elements with all Major Factors Constant a Limit of Two Subspace Iterations were Used largest two subspace dimensions. In conclusion, for large finite element problems increasing the subspace size adds no extra overhead and shows a steady increase in speedup for a higher number of eigenpairs.

Determining the effect of extra eigenpairs in the subspace on the CPU time will be the next goal. Referring to Fig. 5.18, the speedup for the CPU time slowly builds up as the eigenvalues expand in number. Consequently, this leads to the same conclusions found in the previous section, i.e. the overhead associated with parallel processing is becoming less noticeable as the number of eigenvalues and mode shapes increase while keeping all major factors constant. NUMBER OF ElGENPAlRS

Figure 5.17: Impact on Execution Time with an Increasing Subspace Dimension with all Other Factors Constant

NUMBER OF ElGENPAlRS

Figure 5.18: Influence of Increased Eigenpairs on the CPU Time Keeping all Factors Con- stant 5.7 Size of Global Front:

A primary section of additional coding for parallel FEDA is associated with the assembly and elimination of the global front. A major portion of the overhead is accumulated at this section of the domain task. Keeping the degrees of freedom on the global front to the lowest possible total by subdividing the finite element model appropriately will increase the speedup and efficiency of the overall problem. When the number of domains increase so too does the dof on the global front and the best one can hope for is an equivalent number of dof on the boundaries when the finite element problem is subdivided into a greater number of domains compared to the simplest two domain substructure. This has proven to be a deficiency in parallel FEDA which can be seen in Table 5.6 where the global front varies for different test cases. For example, in the two domain problem (Fig. 5.5) with 13 dof on the global front and 115 dof remaining in each domain, subroutine DFRONT in the first and second iteration take up 19% of the total execution time. In contrast, the eight domain problem with 91 dof on the global front and 19 dof remaining in each domain takes 64% of the total execution time.

The four domain model shown in Figs. 5.7-5.9 will be investigated to determine the impact of increasing the dof on the global front with all other factors constant, i.e. q = 6 and two subspace iterations. In Fig. 5.19 the results of the four domain models located in Table 5.6 are plotted. The degrees of freedom in one domain are the sum of the global front dof and the dof remaining in one domain, e.g. in Table 5.6 for Fig. 5.7, the dof for each separate domain is 51 and the dof on the global front is 39 with a total dof in one domain equaling 89 for this test case, the other totals are 114 dof for Fig. 5.8 and 198 dof for Fig. 5.9. Clearly shown in Fig. 5.19 is the impact of increasing the global front in parallel FEDA, a significant loss in speedup is witnessed as the dof on the global front front increase and the remaining dof in each domain decrease.

Table 5.6: Impact of Degrees of Freedom on Global Front for q = 6 and Two Subspace Iterations

The value in [ ] corresponds to the first or second iteration.

In Figs. 5.7 and 5.9, the problem of an idle domain or waiting by one task will arise in DFRONT even though all domains have an equal number of elements, as a consequence of dissimilar domains, e.g. in domains two and three a local front is located on each side of the domain, whereas in domains one and four only one local front exists. This will cause unbalancing in the processors work load because of an increased frontwidth in domains two and three. The domains can not continue execution until all domains post their evcnt in

DFRONT, therefore a domain may sit idle and cause overhead in the parallel solution. For

Fig. 5.7, it was found that DFRONT liad a 0.4 seconds difference between the slowest

(domain 2) and fastest (domain 1) domains. Since domain 2 has a larger wave front due to ., SPEEDUP (OVERALL) ISPEEDUP DFRONT (FIRST ITERATION) r SPEEDUP DFRONT (SECOND ITERATION)

DEGREES OF FREEDOM (IN ONE DOMAIN)

Figure 5.19: Impact of Increasing the Global Front on the Different Four Domain Models (Figs. 5.7-5.9) the local fronts on either side of the domain, the wave front will sweep across the domain slower than domain 1 which has only one local front. Chapter 6

CONCLUSIONS AND RECOMMENDATIONS

The frontal solution (Irons, 1970) and modified subspace method (Akl, et al., 1979) were presented with their advantages. Parallel implementation of these methods on the Cray

X-MP/24 proved successful in achieving computational speedup. In addition, the use of multitasking routines (Cray Research, Inc., 1987) installed on the Cray X-MP computer has proven to be very efficient in mapping each domain to a user task.

The parallel program described in this thesis was found to be an accurate and effective algorithm to solve large linear finite element eigenproblems on the Cray X-MP/24 computer.

The parallel eigensolver demonstrates that speedups in execution time can be achieved compared to a similar sequential algorithm (Figs. 6.1 and 6.2). Utilization of the Cray's multitasking library also proved to be an efficient tool to parallelize the FORTRAN code.

Multitasking subroutines were found to be of minimal impact on the total execution time.

The parallel program takes advantage of the shared and local memory on the MIMD

Cray machine while successfully using a completely connected architecture to transmit

1 2 4 6 8 NUMBER OF PROCESSORS Figure 6.2: Evaluation of 64 Element Plate on Parallel FEDA for q = 6 and Six Subspace Iterations

lations performed by each task to handle the global front lowered the speedup and efficiency

significantly. On the other hand, the performance for the creation of the stiffness and mass

matrices and the modified subspace method were extremely encouraging and indicate the

effectiveness of multitasking on the Cray X-MP computer. Therefore, when calculations

were performed on an element by element basis, in parallel, speedup and efficiency was very

high compared to the section where extra sequential coding was required.

When subdividing a finite element model into N domains one should choose the configuration with the lowest possible dof on the global front for this will increase speedup and efficiency of the parallel algorithm. In addition, load balancing, i.e. assigning an equivalent

amount of work to each task by keeping the number of elements and frontwidth equal in all

domains, is very important in the performance of parallel FEDA. Communication links were

found to be of minimal impact (no idle domains) in parallel processing if all domains were

balanced. Some overhead is accumulated due to one processor being faster than another but the overall execution time of the problem will overshadow this deficiency. Based on the results obtained, the author recommends that one should only further subdivide their finite element model into a greater number of domains only if the number of elements in the separate domains are large enough to overcome the overhead associated with the global front elimination. As a result of the greater number of elements per domain, the amount of work performed by the individual tasks on all subroutines will outweigh subroutine DFRONT in the total execution time. Element numbering is an important aspect in lowering the frontwidth for the frontal technique. The user should always number the elements in the domain so that the wave front converges to the local front, if at all possible. In addition, to optimize element numbering, one should follow the general rules that are applied to the optimum node numbering (Bathe, 1982).

Reported earlier were the two types of multitasking, macrotasking and microtasking, with macrotasking being used herein. The parallelization for the solution of the linear simultaneous equations on the global front by using microtasking routines (Cray Research,

Inc., 1987) could prove to be beneficial because all tasks perform the same calculations on the global front stiffness nlatris. These calculations could be concurrently performed to increase speedup of parallel FEDA, especially in the larger size domains, by using a parallel direct method (Leuze, 1984, McGregor and Salama, 1983, Salama, et al., 1983 and Same11 and Brent, 1977). REFERENCES

a Akl, F. A., (1978)

A Modified Subspace Algorithm for Dynamic Analysis by Finite Elements, and Vi-

bration Control of Structures, Ph.D. Thesis, Department of Civil Engineering, The

University of Calgary, Canada, December 1978.

a Akl, F. A., Dilger, W. H. and Irons, B. M., (1979)

"FEDA" A Finite Element Dynamic Analysis Program, Department of Civil Engi-

neering, The University of Calgary, Canada.

a Akl, F. A., (1983)

"Convergence of Modified Subspace Eigenanalysis Algorithm," Proceedings of the 8th

Conference on Electronic Computation, ASCE, Houston, TX, pp.416-421, Febraury

1983.

a Akl, F. A. and Hackett, R., (1986)

"Multi-Frontal Algorthim for Parallel Processing of Large Eigenproblems," Proceed-

ings of 27th SDM Conference, Paper 86-0929, pp.395-399.

a Adams, L. and Voigt, R. G., (1983)

A Methodology for Exploiting Pamllelism in Finite Element Process, NASA CR- 172219, September 1983.

Arora, J. S. and Nguyen, D. T., (1980)

"Eigensolution for Large Structural Systems with Substructures," International Jour- nal for Numerical Methods in Engineering 15, pp.333-341.

Baldwin, J. T., Razzaque, A. and Irons, B. M., (1973)

"Shape Function Subroutine for an Isoparametric Thin Plate Element,' International

Journal for Numerical Analysis in Engineering 7, pp.431-440.

Bathe, K. J. and Wilson, E. L., (1973)

"Solution Methods for Eigenvalue Problem in Structural Mechanics," International

Journal for Numerical Methods in Engineering 6, pp.213-226.

Bathe, K. J., (1982)

Finite Element Procedures in Engineering Analysis, Prentice-Hall, New York, NY.

Beaufait, F. W., Rowan, W. H., JR., Hoadley, P. G. and Hackett, R. M., (1975)

Computer Methods of Structuml Analysis, Vanderbilt Universiy, Department of Civil

Engineering, Nashville, Tennessee.

Bostic, S. and Fulton,R., (1985)

"A Concurrent Processing Implementation for Stuctural Vibration Analysis,' AIAA

Paper 85-0783, AIAA/ASME/ASCE/AHS 26th Structures, Structuml Dynamics and

Materials Conference, Orlando, FL, April 1985.

Bostic, S. and Fulton, R., (1986)

"Implementation of the Lanzos Method for Structural Vibration Analysis on a Par- allel Computer,' AIAA Paper 86-0930, AIAA/ASME/ASCE/AHS 27th Structures, Structural Dynamics and Materials Conference, San Antonio, TX, May 1986.

Bostic, S. and Fulton, R., (1987)

"A Lanzos Eigenvalue Method on a Parallel Computer," AIAA Paper 87-0725, AIAA

/ASME/ASCE/AHS 28th Structures, Structuml Dynamics and Materials Conference,

Monterey, CA. a Cray Research, Inc., (1987)

Cmy X-MP Multitasking Programmer's Reference Manual, SR-0222, July 1987.

Farhat, C. and Wilson, E., (1987)

"A New Finite Element Concurrent Computer Program Architecture," International

Journal for Numerical Methods in Engineering 24, pp.1771-1792. a Hwang, K. and Briggs, F. A., (1984)

Computer Architecture and Parallel Processing, McGraw-Hill, Inc., New York, NY. r Irons, B. M., (1970)

"A Frontal Solution Program for Finite Element Analysis," International Journal for

Numerical Methods in Engineering 2, pp.5-32. a Irons, B. M. and Ahmad, (1984)

Techniques of Finite Elements, John Wiley and Sons, New York, NY. r Leuze, M. R., (1984)

Parallel Triangulars'zation of Substructured Finite Element Problems, NASA CR-172466,

September 1984. r McCormick, C. W., ed., (1982)

MSC/NASTRAN User's Manual, The McNeal-Schwendier Corporation, Los Angels, CA, April 1982. r McGregor, J. and Salama, M., (1983)

"Finite Element Computation with Parallel VLSI," Proceedings of the 8th Conference

on Electronic Computation, ASCE, Houston, TX, pp.540-553, February 1983. r Melhem, R. G., (1985)

A Modified Frontal Technique Suitable for Parallel Systems, Technical Report ICMA-

85-84, July 1985. r Melosh, R. J. and Bramford, R. M., (1969)

"Efficient Solution of Loading Deflection Equations," Journal of American Society of

Civil Engineers, Structural Divsion Paper No. 5610, pp.661-676. r Paz, M., (1985)

Structural Dynamics Theory and Computation, Van Nostrand Reinhold Company,

New York, NY. r Quinn, M. J., (1987)

Designing Eficient Algorithms for Parallel Computers, McGraw-Hill Book Company,

Inc., New York, NY. r Salama, S., Utku, S. and Melosh, R., (1983)

"Parallel Solution of Finite Element Equations," Proceedings of the 8th Conference

on Electronic Computation, ASCE, Houston, TX, pp.526-539, February 1983. r Sameh, A. and Brent, R. P., (1977)

"Solving Triangular Systems on a Parallel Computer," SIAM Journal on Numerical

Analysis 14, December 1977. Schaeffer, H. G., (1982)

MSC/NASTRAN Primer Static and Normal Modes Analysis, Wallace Press Inc., Mil-

ford, New Hampshire.

Schendel, U., (1984)

Introduction to Numerical Methods for Pamllel Computations, John Wiley and Sons,

New York, NY.

Storaasli, O., Peebles,S., Crockett,T., Knott, J. and Adams, L., (1982)

The Finite Element Machine: An Experiment in Pamllel Processing, NASA TM-

84514, July 1982.

Storaasli, O., Ransom, J. and Fulton, R., (1984)

"Structural Dynamic Analysis on a Parallel Computer: The Finite Element Machine,"

AIAA Paper $4-0966, AIAA/ASME/ASCE/AHS 25th Structures, Structural Dynam-

ics and Materials Conference, Palm Springs, CA, May 1984. r Storaasli, O., Bostic, S., Patrick, M., Mahajan, U. and Shing, M., (1988)

"Three Parallel Computation Mehtods for Structural Vibration Analysis," AIAA Pa-

per 88-2391, AIAA/ASME/ASCE/AHS 29th Structres, Structuml Dynamics and Ma-

terials Conference, Williamsburg, VA, pp.1401-1411, April 1988. r Weaver, W. Jr. and Johnston, P. R., (1987)

Structural Dynamics by Finite Elements, Prentice-Hall, Inc., Englewood Cliffs, New

Jersey. Appendix A Data Input for Parallel FEDA DESCRIPTION VARIABLE COLUMN

1st line (Format 2014)

Total number of nodes. NPOIN Total number of nodes along global front. NGLOBE Total number of elements. NEL Maximum number of nodes per element. NODEL Maximum degrees of freedom per node. NDFMAX Maximun number of solutions or iterations in eigen analysis. NCASE Number of dimensions (2 or 3). NDIM Number of stresses = size of matrix D. NSTRES Number of element types. NTYPE Gauss rule used. NGAUS Number of properties, i.e., E, A, etc. IPROP Number of sets of properties. JPROP Number of nodes with prescribed values. NFIX Number of nodes with additional stiffnesses. NEXTIF Number of nodes with additional loads. NLOAD Maximum number of variables per element. LVAB Number of right hand sides or eigenpairs. NRHS Number of nodes in vibration mode problem. NPOINl Printing option. NPTION Solution option. KODSOL DESCRIPTION VARIABLE COLUMN

2nd line (Format 2014) Number of sets of element loads. JLOAD Number of fronts/domains. NFRONT Maximum number of nodes per global front. GNODEL

3rd line (Format E10.2) Tolerance level in eigen analysis.

4th set of lines (Format 2014) [Enter values for each set of properties] Number of degrees of freedom per node. NDF(N,J) 1-4, etc. where N = 1,NODEL J = 1,JPROP - Total number of lines = JPROP

5th set of lines (Format 2014) [Enter values for each element] Element number. NE 1-4 Type of element. LTYPE(NE) 5-8 Property type of element. LPROP(NE) 9- 12 Load type. LLOAD(NE) 13-16 Domain/Front the element is located on. LDMAIN(NE) 17-20 Element node numbers. LNODS(N,NE) 21-24, etc. where N = 1,NODEL NE = 1,NEL - Total number of lines = NEL 88

DESCRIPTION VARIABLE COLUMN

6th set of lines (Format I5,3F11.4) [Enter values for each node] Node number. NODES 1-5 Coordinates of node (X,Y,Z). COORD(N,I) 6-16, etc. where I = 1,NDIM N = 1,NPOIN - Total number of lines = NPOIN

7th set of lines (Format 216,6F11.7) [Enter values for each node that is fixed] Node number that is fixed. NODFIX(N) 1-6 Code number listing which degree of freedom is prescribed at each node. KODFIX(N) 7-12 Prescribed displacements at the nodes listed in NODFIX. VFIX(N,I) 13-23, etc. where I = 1,NDFMAX N = 1,NFIX - Total number of lines = NFIX

8th set of lines (Format 15,6E12.5) [Enter a value for each node with additional stiffness] Node number with additional stiffness. NOSTIF(N) 1-5 Additional stiffnesses at the nodes. VSTIF(N71) 6-17, etc. where I = 1,NDFMAX N = 1,NEXTIF - Total number of lines = NEXTIF DESCRIPTION VARIABLE COLUMN 9th set of lines (Format 15,6E12.5) [Enter values for each node with additional loads] Node number with additional loads. NODLOD(N) 1-5 Additional loads at the nodes. VLOAD(N,I,J) 6-17, etc. where I = 1,NDFMAX N = 1,NLOAD J = 1,NRHS - Total number of lines = NLOAD Repeated NRHS times.

10th set of lines (Format 15,6E12.5) [Enter values for each set of properties] Property set number. N 1-5 Enter property values for the VPROP(1,N) 6-16, etc. specific element you are using. where I = 1,IPROP J = 1,JPROP - Total number of values = IPROP - Total number of lines = JPROP

11th set of lines (Format 15,6E12.5) [Enter values for each node] Node number. N 1-5 Load set. SLOAD(1,N) 6-17, etc. where NR = 1,NRHS I = 1,JLOAD - Total number of lines = JLOAD DESCRIPTION VARIABLE COLUMN

12th set of lines (Format 2014) [Enter each node on global front] Node number on global front. NDMAIN(1) 1-4, etc. where I = 1,NGLOBE - Total number of values = NGLOBE Appendix B Output of Parallel FEDA

An output file containing the input data and eigenvalues is presented for a two domain 64 element plate shown in Fig. B.1.

GLOBAL FRONT

Figure B.l: A Two Domain 64 Element Plate with 13 DOF on the Global Front x F. FRED AKL AND M. MOREL x 3CX3CxxxxxxxxxxxxxxxxWxxxxfxxxxxxfWxxWWWWxWWWx

PROBLEM...... DESCRIPTION

MAXIMUM NODE NUMBER = NPOIN = 233 NUMBER OF NODES ALONG BOUNDRY = NGLOBE = 9 NUMBER OF ELEMENTS = NEL = 6 4 MAXIMUM NODES PER ELEMENT = NODEL = 8 MAXIMUM DEGREES OF FREEDOM PER NODE = NDFMAX = 3 NUMBER OF SOLUTIONS REQUIRED = NCASE = 100 NUMBER OF DIMENSIONS, 2 OR 3, = 3 NUMBER OF STRESSES = SIZE OF MATRIX D = NSTRES 3 NUMBER OF ELEMENT TYPES = NTYPE = 1 GAUSS RULE USED = NGAUS = 2 POINTS NUMBER OF PROPERTIES: E p A r I RHO THICK = IPROP = 5 NUMBER OF SETS OF PROPERTIES AVA~LABLE= JPROP = 1 NUMBER OF NODES WITH SOME FIXED VALUES = NFIX = 8 0 NUMBER OF NODES WITH ADDITIONAL STIFFNESSES = NEXTIF = 0 NUMBER OF NODES WITH ADDITIONAL LOADS = NLOAD = 0 MAXIMUM VARIABLES PER ELEMENT = LVAB = 16 NUMBER OF RIGHT HAND SIDES = NRHS = 8 NUMBER OF INTEGER WORDS PER FLOATING WORD = INTEG = 1 LENGTH OF VECTOR OF FLOATING llORDS = LENVEC = 200000 LENGTH OF VECTOR OF FLOATING WORDS = LNVECl = 100000 IiUMBER OF SETS OF ELEMENT LOADS = JLOAD = 0 NUMBER OF NODES IN VIBRATION MODE PROFILES = NPOINl = 0 PRINTING OPTION = NPTION = 1 SOLUTION CODE = KODSOL = -3 KODSOL = 0 IE. STATIC SOLUTION KODSOL = -1 IE. MODAL DYNAMIC ANALYSIS KODSOL = -2 IE. DIRECT DYNAMIC ANALYSIS KODSOL = -3 IE. SUBSPACE EIGEN ANALYSIS TOLERENCE LEVEL IN EIGEN ANALYSIS = TOL2 = 0.10000E-06 NUMBER OF FRONTS = NFRONT = 2 MAXIMUM NODES PER GLOBAL FRONT = GNODEL = 9 DEGREES OF FREEDOM AT NODES OF ELEMENT OF TYPE 1 = NDF = 3 1 3 1 3 1 3 1 ELEMENT TYPE PROPERTY LOAD DOMAINNODE NUMBERS = LNODS NUMBER 15 16 17 18 19 2 0 2 1 22 2 3 24 25 26 27 28 2 9 3 0 3 1 32 33 3 4 35 3 6 37 38 3 9 4 0 4 1 42 4 3 44 45 46 47 48 4 9 5 0 5 1 52 53 54 55 5 6 57 58 5 9 6 0 6 1 6 2 6 3 6 4 NOD 1 2 3 4 5 6 7 8 9

i .oooo 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1 .oooo 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1 .oooo 1.0000 26.0000 26.0000 27.0000 27.0000 27.0000 27.0000 27.0000 28.0000 28.0000 28.0000 28.0000 28.0000 28.0000 28.0000 28.0000 28.0000 29.0000 29.0000 29.0000 29.0000 29.0000 30.0000 30.0000 30.0000 30.0000 30.0000 30.0000 30.0000 30.0000 30.0000 31.0000 31.0000 31.0000 31.0000 31.0000 32.0000 32.0000 32.0000 32.0000 32.0000 32.0000 32.0000 232 32.0000 233 32.0000 8.0000 1.0000 NODE FIXING CODE FIXED VALUES = NODFIX = KODFIX = VFIX 1 2

191 111 0.0000000 0.0000000 0.0000000 196 1 0.0000000 0.0000000 0.0000000 205 111 0.0000000 0.0000000 0.0000000 210 1 0.0000000 0.0000000 0.0000000 219 111 0.0000000 0.0000000 0.0000000 224 1 0.0000000 0.0000000 0.0000000 NUMBER ELEMENT PROPERTIES = VPROP 1 0.100000E+01 0.100000E+01 0.300000E+00 0.800000E+01 0.100000E+Ol

LIST OF NODES LOCATED ALONG GLOBAL FRONT 113 114 115 116 117 118 119 120 121 313131313 TOL2 HAS BEEN ACHIEVED CURRENT TOLERENCE LEVELS -0.4739E-14 0.0000E+00 0.0000E+00 -0.1651E-13 0.1626E-13 -0.1164E-13 0.4111E-14 0.7575E-07 NUMBER OF SUBSPACE ITERATIONS = 23 EIGENVALUES ARE 0.1171E-01 0.1306E-01 0.1569E-01 0.2017E-01 0.2731E-01 0.3814E-01 0.5401E-01 0.7662E-01