Automated computational modelling for complicated partial differential equations
Automated computational modelling for complicated partial differential equations
Proefschrift
ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op dinsdag 3 december 2013 om 12.30 uur
door
Kristian Breum ØLGAARD Master of Science in Civil Engineering, Aalborg Universitet Esbjerg geboren te Ringkøbing, Denemarken Dit proefschrift is goedgekeurd door de promotor: Prof. dr. ir. L. J. Sluys
Copromotor: Dr. G. N. Wells
Samenstelling promotiecommissie: Rector Magnificus Voorzitter Prof. dr. ir. L. J. Sluys Technische Universiteit Delft, promotor Dr. G. N. Wells University of Cambridge, copromotor Dr. ir. M. B. van Gijzen Technische Universiteit Delft Prof. dr. P. H. J. Kelly Imperial College London Prof. dr. R. Larsson Chalmers University of Technology Prof. dr. L. R. Scott University of Chicago Prof. dr. ir. C. Vuik Technische Universiteit Delft Prof. dr. A. Scarpas Technische Universiteit Delft, reservelid
Copyright © 2013 by K. B. Ølgaard Printed by Ipskamp Drukkers B.V., Enschede, The Netherlands ISBN 978-94-6191-990-8 Foreword
This thesis represents the formal end of my long and interesting journey as a PhD student. The sum of many experiences over the past years has increased my knowledge and contributed to my personal development. All these experiences originate from the interaction with many people to whom I would like to express my gratitude. I am most grateful to Garth Wells for giving me the opportunity to come to Delft and to study under his competent supervision. His constructive criticism and vision combined with our nice discussions greatly improved the quality of my research. As the head of the computational mechanics group, Bert Sluys has played a vital role by creating a very nice and supportive working environment where people enjoy a lot of creative freedom. As creativity is key in this research I consider myself lucky to have been part of Bert’s group. Ronnie Pedersen did a very good job in persuading me to come to Delft for a PhD, and I am happy that he managed to convince me. I am also grateful for enjoying his friendship throughout the years, the good times on the football pitch, and the even better times in ’t Proeflokaal watching football and discussing work and life in general. A friendly and inspiring working environment is important in order to produce quality work. Therefore, I would like to thank past and present colleagues Rafid Al- Khoury, Roberta Bellodi, Frank Custers, Frank Everdij, Huan He, Cecilia Iacono, Cor Kasbergen, Oriol Lloberas-Valls, Prithvi Mandapalli, Frans van der Meer, Andrei Metrikine, Peter Moonen, Dung Nguyen, Vinh Phu Nguyen, Mehdi Nikbakth, Marjon van der Perk, Frank Radtke, Zahid Shabir, Xuming Shan, Angelo Simone, Mojtaba Talebian, Andy Terrel, Ilse Vegt, Jaap Weerheijm, Sigurd Blöndal, Lars Damkilde, Niels Dollerup, Jens Hagelskjær, Michael Jepsen, Sven Krabbenhøft and Søren Lambertsen. In particular, I would like to thank Frans for the years that we shared the same office and for translating the propositions into Dutch. A special thanks goes to Mehdi, my ‘brother-in-arms’, the only person remaining in the group who was also involved with the FEniCS Project after Garth left for Cambridge and Xuming left for home. The research presented in this thesis, is centered around the FEniCS Project and, vi
therefore, I would also like to thank all the people in the FEniCS community, in particular my close collaborators from Simula Anders Logg, Martin Alnæs, Marie Rognes and Johan Hake for all the nice discussions, debugging assistance and good ideas. During my PhD, I also had the pleasure of visiting the University of Michigan and in this regard I want to thank Krishna Garikipati, Jake Ostien and his wife Erin for their hospitality during my stay in Ann Arbor. Outside the office, I enjoyed many hours in the good company of my friends Linda Grimstrup and Lars Freising which definitely improved the quality of my social life a lot. I also want to thank all my former team mates at Vitesse Delft for the many memorable hours on the football pitch trying to learn the secrets behind ‘totaalvoetbal’. Although The Netherlands and Denmark are quite similar in terms of weather, nature and culture it was always nice to receive visitors from home. For this, I would like to thank my friends Kenneth Guldager, Henrik Hansen, Mads Madsen, Christian Meyer, Nick Nørreby and Thomas Sørensen. Last, but certainly not least I want to thank my parents and my brother and sisters for their encouragement, support, help and visits during my years in Delft. I also wish to thank both of my sons for putting things in perspective which helped me to focus during the last iterations towards finishing this thesis. Of all people, I am most grateful to my wife. I know her patience has been tested to the limit, yet she remained supportive, loving and caring during all the years. For this, and for our sons, I am forever indebted. The research presented in this thesis was carried out at the Faculty of Civil Engineering and Geosciences at Delft University of Technology. The research was supported by the Netherlands Technology Foundation STW, the Netherlands Organisation for Scientific Research and the Ministry of Public Works and Water Management.
Kristian Breum Ølgaard Ølgod, Denmark, November 2013 Contents
1 Introduction1 1.1 Research objectives and approach...... 2 1.2 Outline...... 4 1.3 The FEniCS Project...... 5 1.3.1 Simple model problem...... 6 1.3.2 Unified Form Language...... 8 1.3.3 FEniCS Form Compiler...... 11 1.3.4 Unified Form-assembly Code...... 15 1.3.5 DOLFIN...... 18
2 FEniCS applications to solid mechanics 29 2.1 Governing equations...... 30 2.1.1 Preliminaries...... 30 2.1.2 Balance of momentum...... 31 2.1.3 Potential energy minimisation...... 32 2.2 Constitutive models...... 33 2.2.1 Linearised elasticity...... 33 2.2.2 Flow theory of plasticity...... 34 2.2.3 Hyperelasticity...... 35 2.3 Linearisation issues for complex constitutive models...... 36 2.3.1 Consistency of linearisation...... 36 2.3.2 Quadrature elements...... 38 2.4 Implementations and examples...... 41 2.4.1 Linearised elasticity...... 43 2.4.2 Plasticity...... 43 2.4.3 Hyperelasticity...... 49 2.4.4 Elastodynamics...... 52 2.5 Current and future developments...... 53 viii Contents
3 Representations and optimisations of finite element variational forms 57 3.1 Motivation and approach...... 58 3.2 Representation of finite element tensors...... 60 3.2.1 Quadrature representation...... 61 3.2.2 Tensor contraction representation...... 62 3.3 Quadrature optimisations...... 64 3.3.1 Eliminate operations on zeros...... 67 3.3.2 Simplify expressions...... 69 3.3.3 Precompute integration point constants...... 71 3.3.4 Precompute basis constants...... 72 3.3.5 Further optimisations...... 73 3.4 Performance comparisons of representations...... 75 3.4.1 Performance for a selection of forms...... 75 3.4.2 Performance for common, simple forms...... 80 3.4.3 Performance for forms of increasing complexity...... 82 3.5 Performance comparisons of quadrature optimisations...... 86 3.6 Automatic selection of representation...... 92 3.7 Future optimisations...... 93
4 Automation of discontinuous Galerkin methods 97 4.1 Extending the framework to discontinuous Galerkin methods.... 98 4.1.1 Extending the Unified Form Language...... 99 4.1.2 Extending the Unified Form-assembly Code...... 100 4.1.3 Extending the FEniCS Form Compiler...... 100 4.1.4 Extending DOLFIN...... 101 4.2 Examples...... 103 4.2.1 The Poisson equation...... 104 4.2.2 Steady state advection–diffusion equation...... 105 4.2.3 The Stokes equations...... 109 4.2.4 Biharmonic equation...... 110 4.2.5 Further applications...... 114
5 Automation of lifting-type discontinuous Galerkin methods 115 5.1 Lifting-type formulation for the Poisson equation...... 116 5.2 Semi-automated implementation of lifting-type formulations.... 117 5.3 Comparison of IP and lifting-type formulations...... 122 5.4 Future developments...... 127
6 Strain gradient plasticity 129 6.1 A strain gradient plasticity model...... 130 6.2 A discontinuous Galerkin formulation for the plastic multiplier... 133 6.3 Linearisation of the governing equations...... 135 Contents ix
6.4 Implementation...... 137 6.4.1 The predictor step...... 138 6.4.2 The corrector step...... 139 6.4.3 Implementing the variational forms...... 141 6.5 Numerical examples...... 141 6.5.1 Unit square loaded in shear with strain softening...... 143 6.5.2 Plate under compressive loading with strain softening.... 146 6.5.3 Plate under compressive loading with strain hardening... 156 6.5.4 Micro-indentation...... 160 6.5.5 Computational notes...... 162
7 Conclusions and future developments 167
References 171
Summary 183
Samenvatting 185
Propositions 187
Stellingen 189
Curriculum vitae 191
1 Introduction
Since the advent of the modern programmable computer in the 1940s, the cost of computing power relative to manpower has decreased significantly. As a con- sequence, high-level programming languages have emerged allowing the imple- mentation of programs in source code using abstractions that are independent of the specific computer architectures on which the program is intended to run. A compiler is then invoked to translate the source code into machine code targeted for the given computer’s central processing unit (CPU). This development has allowed, among other things, researchers and scientists to write programs for investigating and solving various classes of problems numerically. In engineering, physical phenomena are often described mathematically by partial differential equations (PDEs), and a commonly used method to solve these equations is the finite element method (FEM). Standard finite element software typ- ically provide a problem solving environment for a set of engineering problems using a predefined selection of finite elements. As part of the application program- ming interface (API) a user can often supply subroutines which implement special methods, for instance, the constitutive model in case of a solid mechanics problem. This offers a degree of customisation and flexibility in terms of implementing certain models, but the approach may fall short as the complexity of a model increases. Strain gradient plasticity is an example of a class of models which can be difficult to implement in traditional finite element software and researchers often resort to implementing their own unique solver targeting a specific model. An implementation involves translating the abstract mathematical representation of the model into source code which can be handled by a compiler, a process which can be tedious, time consuming and error prone. However, by introducing a higher level of abstraction, the burden of this process can be alleviated when it comes to implementing mathematical representations of the FEM for solving PDEs. A possible abstraction consists of a form language for expressing the mathematical formulation of the given problem, and compilers which automatically generate efficient source code from the given mathematical expressions. This thesis is centered around this type of automated mathematical modelling. 2 Chapter 1. Introduction
1.1 Research objectives and approach
The research presented in this thesis aims at developing concepts, tools and meth- ods which allow researchers and application developers to create efficient solvers for complicated partial differential equations with relatively little effort. Sev- eral software projects aim at providing a flexible framework for solving partial differential equations using the finite element method. These software projects include, among others, traditional finite element libraries and toolboxes such as deal.II (http://www.dealii.org/, Bangerth et al.(2007)), Diffpack ( http://www. diffpack.com/, Langtangen(1999)), DUNE ( http://www.dune-project.org, Bas- tian et al.(2008b,a)), GetFEM++ ( http://home.gna.org/getfem/), OpenFOAM (http://www.openfoam.com/) and Cactus (http://cactuscode.org/, Allen et al. (2000)). However, a bit of ‘hand coding’ is often needed in order to use the above mentioned software. For instance, a user must typically implement (parts of) the assembly algorithm which is cumbersome as the complexity of the problem is increasing. A number of software projects have, therefore, emerged that try to automate the finite element method. These projects include, among others, FINGER (Wang, 1986), Archimedes (Shewchuk and Ghattas, 1993), Symbolic Me- chanics System (Korelc, 1997), GetDP (http://geuz.org/getdp/, Dular et al.(1998)), FreeFEM++ (http://www.freefem.org/), Sundance (http://www.math.ttu.edu/ ~kelong/Sundance/html/, Long et al.(2010)), Feel++ ( http://www.feelpp.org/, Prud’homme(2006)) and the FEniCS Project ( http://fenicsproject.org, Logg et al.(2012a)). A common feature of these approaches is that they provide a higher level of abstraction for expressing variational forms and thereby lessen the burden on application developers. The developments presented in this thesis are implemented in various software components of the FEniCS Project which is chosen for a number of reasons. The software is released under an open source license1 which makes it possible to obtain and modify the source code. This provides a high degree of freedom and flexibility in terms of implementing advanced models and applications. Furthermore, if the application source code is published, the implementation becomes completely transparent and reproducible, both properties of importance in research. The software contains a problem solving environment that handles the assembly, the application of boundary conditions and the solution of sparse systems of equations. What distinguishes the software from the more conventional finite element packages is that it provides a high degree of mathematical abstraction by implementing a form language for expressing variational forms and relies on form compilers to automatically generate computer code for the local finite element tensor. This approach offers several advantages of which two are of particular interest. Firstly, the time needed to implement, test and debug the code for the local finite element
1All FEniCS core components are licensed under the GNU LGPL version 3 (or any later version) as published by the Free Software Foundation (http://www.fsf.org). 1.1. Research objectives and approach 3 tensor can be reduced. Secondly, various optimisations can be employed by the form compilers during the code generation stage to make the generated code competitive with hand optimised code. The importance of these two advantages is proportional to the complexity of the variational form. Finally, the software is under active development by a growing community which is helpful for receiving feedback when implementing new features and during debugging sessions. The potential of the FEniCS framework is evident, however, at the time when this work was commenced, the functionality in the FEniCS software was only available for a limited class of problems. For instance, only integration over element interiors was supported which precluded, among other things, discontinuous Galerkin methods from being handled as these methods involve integration over element boundaries. Furthermore, problems like conventional plasticity were not possible to solve because the software could only handle functions coming from a finite element space. Also, the generated code was only efficient for limited classes of problems. The objectives of this work can thus be condensed into the following: extend the automated mathematical modelling framework of FEniCS such that
• discontinuous Galerkin methods can be handled;
• rapid prototyping of advanced models and applications is possible; and
• efficiency is maintained also for complex problems in general.
As will be demonstrated in this work, addressing the above three issues has had a significant impact on the range of problems which can be handled in the FEniCS framework and thereby making life easier for researchers and application developers. A complex application from solid mechanics in the form of a strain gradient plasticity model is considered, as an example, to demonstrate the extensions to the FEniCS framework developed in this work. Strain gradient models are often used to provide regularisation in softening problems and to account for observed size effects at small length scales. An abundance of strain gradient models have been proposed in literature including the models by Aifantis(1984), Gurtin(2004), Fleck and Hutchinson(1997), Fleck and Hutchinson(2001) and Gao et al.(1999) to name a few. The focus in this work is on the class of models involving gradients of fields such as the equivalent plastic strain. An example of such a model is that proposed by Aifantis(1984) which involves the addition of the Laplacian of the equivalent plastic strain to the classical yield condition. A feature of this particular model is that the classical consistency condition leads to a partial differential equation rather than an algebraic equation, as is the case is classical flow theory of plasticity. The partial differential equation is only active in the region undergoing plastic deformations which introduces the difficulty of imposing non-standard boundary conditions on the secondary field on the evolving boundary. 4 Chapter 1. Introduction
Motivated by the work of Wells et al.(2004) and Molari et al.(2006) who used a discontinuous Galerkin formulation for a strain gradient-dependent damage model, a discontinuous basis can be used to interpolate the secondary field. This provides a natural framework for handling evolving elastic–plastic boundaries and provides local (cell-wise) satisfaction of the yield condition. To satisfy the regularity requirement of the secondary field, a discontinuous Galerkin formulation is used to enforce weak continuity across cell facets. In order to allow the use of a discontinuous constant basis for the secondary field, a so-called lifting-type discontinuous Galerkin formulation, proposed by Bassi and Rebay(1997, 2002), is adopted. A discontinuous constant basis is the natural choice for the secondary field when a linear continuous basis is used for the displacement field. Considering that the formulation involves an additional field variable it is also computationally more efficient if discontinuous constant elements can be used for this particular field.
1.2 Outline
The rest of this chapter contains an overview of the FEniCS Project including details on the components pertinent to the present work. Chapter2 continues with a demonstration of how to use the FEniCS toolchain for solid mechanics applications. The purpose of this demonstration is twofold. Firstly, it serves as an introduction to the concepts of automated modelling from a solid mechanics point of view, which will give an understanding of how the automated modelling approach can be utilised to also tackle more complex problems. Secondly, the presented models and applications will be used in subsequent chapters, either by extending the models or by using them as a platform for discussing the development of FEniCS components in connection to the work presented in this thesis. Local finite element tensors can be evaluated using different representations of the tensors. In Chapter3 the two representations that FFC adopts, the quadrature representation and the tensor contraction representation are presented and comparisons are made between the two representations. Furthermore, optimisation strategies for the quadrature representation are discussed and the performance of these are investigated. Chapter4 introduces the extensions implemented in the FEniCS framework to allow a class of discontinuous Galerkin (DG) formulations to be handled in an automated fashion. Building on these abstractions, a semi-automated approach to implementing lifting-type DG formulations is presented in Chapter5. This chapter also contains a brief comparison, in terms of complexity regarding the implementation and the numerical implications, between a lifting-type formulation and an interior penalty (IP) DG formulation for the Poisson equation. In Chapter6 the extensions, developed in the previous chapters, to the FEniCS 1.3. The FEniCS Project 5 framework are brought together in an implementation of a lifting-type discontinu- ous formulation for a simple strain gradient plasticity model proposed by Aifantis (Aifantis, 1984). The purpose is to illustrate how researchers and application de- velopers may create solvers for more complex problems on top of the FEniCS software. Finally, in Chapter7, conclusions are drawn and recommendations for future development related to this work are presented.
1.3 The FEniCS Project
The FEniCS Project is a suite of open source programs for automating the solution of PDEs. The concepts and components which are most important in relation to this work and which will be elaborated on in subsequent chapters are presented. Thus, only a subset of the components in the FEniCS Project is presented. Further details on the components presented here, and other components associated with the FEniCS Project, can be found in the FEniCS book (Logg et al., 2012a) or online at http://fenicsproject.org. All FEniCS software components, and the software developed in this work, can be obtained freely at https://bitbucket.org/ fenics-project2. The FEniCS Project is under continuous development, however, this presentation and all example code, and software developed and described in this work, is compliant with version 1.0 of the project and its associated components unless stated otherwise. The majority of developments in this work is implemented in the core com- ponents of FEniCS. However, some of the developments are implemented in the FEniCS Solid Mechanics library3 (Ølgaard and Wells, 2013). In this thesis, several code snippets are presented along with many results from numerical experiments. All example code, and the code which has been used to obtain all the results, can be downloaded from https://bitbucket.org/k.b.oelgaard/ oelgaard-thesis-supporting-material. Note that in order to run the code, work- ing installations of FEniCS version 1.0 and FEniCS Solid Mechanics version 1.0 are required4. The procedure of solving PDEs using the FEM can be broken down into the following four steps: 1. Formulate the variational problem of the PDE
2The FEniCS software components have recently moved from Launchpad (https://launchpad.net/ fenics) to Bitbucket. However, as the FEniCS Project is being actively developed the location might change again in the future. The FEniCS website (http://fenicsproject.org), which is less likely to move, might be a better starting point for locating the software. 3The FEniCS Solid Mechanics library was formerly known as FEniCS Plasticity (https://launchpad. net/fenics-plasticity) which focussed solely on plasticity problems. However, to reflect that the scope of the library has increased to also include more general solid mechanics problems the name was changed during a recent migration from Launchpad to Bitbucket. 4Version 1.0 of the FEniCS Solid Mechanics library can be downloaded from https://bitbucket. org/fenics-apps/fenics-solid-mechanics. 6 Chapter 1. Introduction
UFL FFC UFC DOLFIN
Figure 1.1: FEniCS toolchain for solving a PDE using the FEM.
2. Discretise the formulation
3. Finite element assembly
4. Solve the global system of equations
Facilities for each of these steps are implemented in separate software components in FEniCS. The relationship between input and output of each component in the FEniCS toolchain for the finite element procedure is shown in Figure 1.1. In short, the variational form of the PDE is expressed in the Unified Form Language (UFL) (Alnæs et al., 2013; Alnæs, 2012), which is given as input to the FEniCS Form Compiler (FFC)5 (Kirby and Logg, 2006, 2007; Logg et al., 2012c; Ølgaard et al., 2008a) that automatically generates efficient C++ code for evaluating the local element tensors. The output from FFC is compliant with the interface defined in Unified Form-assembly Code (UFC) (Alnæs et al., 2009, 2012) and is used by DOLFIN (Logg and Wells, 2010; Logg et al., 2012d), which is the finite element assembler and solver of FEniCS although, in principle, any assembly library which supports UFC can be used. The key advantage of this modular construction is that it becomes more trans- parent where and how new features and functionality should be implemented. Furthermore, developers and users can pick individual components to form their own applications. In this work for instance, the UFL is augmented with discontinu- ous Galerkin operators6, compiler optimisations are implemented in FFC, while more complex solvers for lifting-type formulations and solid mechanics problems can be implemented on top of the FEniCS toolbox.
1.3.1 Simple model problem
As a model boundary value problem for presenting the FEniCS framework consider the Poisson equation, which for a body Ω Rd, where 1 d 3, with boundary ⊂ ≤ ≤
5Any compiler that supports UFL as input, and outputs UFC code, can be used instead of FFC in the described toolchain. The Symbolic Form Compiler (Alnæs and Mardal, 2010, 2012), which is also part of FEniCS, is one such example. 6Historically, the DG operators were implemented in the original form language of FFC which was later merged into the richer UFL. 1.3. The FEniCS Project 7
∂Ω and outward unit normal vector n : ∂Ω Rd reads: → ∆u = f in Ω, − u = g on ΓD, (1.1) u n = h on Γ . ∇ · N Here, u is an unknown scalar field, f is a source term, g is a prescribed value for u on the Dirichlet boundary ΓD, and h is a prescribed value for the outward normal derivative of u on the Neumann boundary ΓN. The boundaries ΓD and ΓN divide the boundary such that Γ ∂Ω and Γ = ∂Ω Γ . To apply the FEniCS D ⊆ N \ D framework the problem must be posed as a variational formulation in the following canonical form: find u V such that ∈ a (u, v) = L (v) v Vˆ , (1.2) ∀ ∈ where V is the trial space and Vˆ is the test space, a (u, v) and L (v) denote the bilinear and linear forms, respectively. A typical variational form7 of (1.1) defines the bilinear and linear forms as: Z a (u, v) := u v dx (1.3) Ω ∇ · ∇ Z Z L (v) := f v dx + hv ds, (1.4) Ω ΓN with the trial and test spaces defined as: n o V := v H1(Ω) : v = g on Γ , (1.5) ∈ D n o Vˆ := v H1(Ω) : v = 0 on Γ . (1.6) ∈ D The variational problem in (1.2) must be discretised to compute a finite element solution to the Poisson problem. This is done by using a pair of discrete function spaces for the test and trial functions: find u V V such that h ∈ h ⊂ a (u , v) = L (v) v Vˆ Vˆ . (1.7) h ∀ ∈ h ⊂ Thus, after transforming the strong form of the problem into the variational coun- terpart, the FEniCS toolchain, starting with UFL, can be invoked to compute a solution.
7Chapters4 and5 presents discontinuous Galerkin formulations for (1.1). 8 Chapter 1. Introduction
UFL code element= FiniteElement("Lagrange", triangle, 1)
u= TrialFunction(element) v= TestFunction(element) f= Coefficient(element) h= Coefficient(element)
a= inner(grad(u), grad(v)) *dx L=f *v*dx+h *v*ds
Figure 1.2: UFL code for the Poisson problem using continuous-piecewise linear Lagrange polynomials on triangles.
1.3.2 Unified Form Language
In order to compute a solution to the variational problem using the FEM, it is neces- sary to discretise the formulation. The Unified Form Language (UFL) (Alnæs et al., 2013; Alnæs, 2012) enables a user to express the discretisation compactly using a notation which resembles the mathematical notation closely. UFL is implemented as a domain-specific embedded language (DSEL) in Python which, among other things, allow users to define custom operators using all features of the Python programming language when writing UFL code. This section presents the most basic features used throughout in this work, while some of the more advanced func- tionality is presented in subsequent chapters as needed. For a detailed description of the language, refer to Alnæs et al.(2013). The Poisson problem in (1.7) can be expressed in UFL by the code shown in Figure 1.2. The first line in the code defines the local finite element basis that spans the discrete function space V on an element T where denotes the standard h ∈ Th Th triangulation of Ω. Generally, finite elements are defined in UFL by their family, cell and degree:
UFL code element= FiniteElement(family, cell, degree) which in the given case, in Figure 1.2, means that the basis is a piecewise continuous linear Lagrange triangle. UFL contains a set of predefined finite element family names, for instance, "Lagrange" as already shown, "Discontinuous Lagrange" (short name "DG") and "Brezzi-Douglas-Marini" (short name "BDM"). The cell argument denotes the polygonal shape of the finite element while the degree argument denotes the degree of the polynomial space. Although valid cell shapes in UFL are: interval, triangle, tetrahedron, quadrilateral and hexahedron. FFC only supports the first three cell shapes at present. Also note that the permitted 1.3. The FEniCS Project 9
Mathematical notation UFL notation Mathematical notation UFL notation A B dot(A, B) A A.dx(i) · ,i A : B inner(A, B) ∂A Dx(A, i) ∂xi AB A B outer(A, B) dA Dn(A) T ≡ ⊗ dn A transpose(A), A grad(A) ∇ A.T A div(A) sym A sym(A) ∇ · tr A A tr(A) ≡ ii det A det(A)
Table 1.1: (Left) Table of tensor algebraic operators. (Right) Table of differential operators. value of cell and degree depend on the choice of finite element family. It is important to realise that UFL is only concerned with the abstract operations related to the finite element function spaces; it is left to the form compiler to support the element families, that is, to generate meaningful code for the representation of elements and forms. For mixed finite element methods, product spaces like:
V = [V V ] V . (1.8) 2 × 2 × 1 can easily be generated by either the MixedElement class or the * operator:
UFL code V_2= FiniteElement("Lagrange", triangle, 2) V_1= FiniteElement("Lagrange", triangle, 1) V= (V _2*V_2)*V_1 W= MixedElement(MixedElement((V _2, V_2)), V_2) meaning that V and W are identical. To create a mixed element in which all the component spaces are identical, the VectorElement can be used:
UFL code V= VectorElement(family, cell, degree, dim=None) where dim defaults to the dimension of the given cell unless explicitly specified. After defining the local finite element basis, the trial function u V , the ∈ h test function v V and the coefficient functions f , g V can be defined in ∈ h ∈ h a straightforward fashion as seen in the code in Figure 1.2. The bilinear and linear forms from (1.3) and (1.4) can then be implemented simply by using the tensor and differential operators defined in UFL, some of which can be seen in Table 1.1. An important thing to note is that the definition of the gradient operator grad(A) of, for instance, a vector valued function in UFL is grad(u) = ∂u /∂x { }ij i j 10 Chapter 1. Introduction
Mathematical notation UFL notation Mathematical notation UFL notation a a / b cos(f) b cos f b a a**b, pow(a,b) sin f sin(f) p f sqrt(f) tan f tan(f) exp f exp(f) arccos f acos(f) ln f ln(f) arcsin f asin(f) f abs(f) arctan f atan(f) | | sign f sign(f)
Table 1.2: (Left) Table of elementary functions. (Right) Table of trigonometric functions. and not grad(u) = ∂u /∂x . The latter operator is, however, provided in { }ij j i UFL by nabla_grad(A). A similar convention applies to the divergence operator where nabla_div(A) is provided as an alternative to div(A). In this work, the operators u and u follow the UFL definition for the gradient and divergence ∇ ∇ · operators, grad(u) and div(u), respectively and should not be confused with the UFL operators nabla_grad(u) and nabla_div(u). To complete the implementation of the variational forms, integration on the R relevant domains must be expressed. In UFL, the integral over the domain I dx Ωk is denoted by I*dx(k) while the integral over the exterior boundary of the domain R I ds is denoted by I*ds(k) where k is the subdomain number and I is a valid ∂Ωk UFL expression. Thus, having completed the implementation of (1.2) in the near- mathematical notation of UFL, the form compiler can be invoked to generate code from the abstract UFL representation. The last two classes of expressions to be presented in this short introduction to UFL are nonlinear scalar functions and geometric quantities. UFL provides a set of nonlinear scalar functions, presented in Table 1.2, which can be applied to, for instance, scalar valued coefficient functions such as f and g in the Poisson example. It is illegal to apply these functions to any test or trial function as this would render the variational form nonlinear in those arguments. Geometric quantities are related to the local finite element cell T. For instance, the coordinate of the integration point currently being evaluated on T (including its boundary) can be accessed via cell.x. Other geometric quantities which are particularly useful in relation to this work are the outward normal to the facet8 currently being evaluated cell.n and the circumradius, the radius of the circumscribed circle of T, cell.circumradius. Basic usage of nonlinear functions and the integration point coordinate is demonstrated later in Section 1.3.5, while the facet normal and circumradius are frequently used
8A facet is a topological entity of a computational mesh of dimension D 1 (codimension 1) where D is the topological dimension of the cells of the computational mesh. Thus− for a triangular mesh, the facets are the edges and for a tetrahedral mesh, the facets are the faces. 1.3. The FEniCS Project 11 when defining discontinuous Galerkin variational forms in Chapter4.
1.3.3 FEniCS Form Compiler As shown in Figure 1.1, the FEniCS Form Compiler (FFC) (Kirby and Logg, 2006, 2007; Logg et al., 2012c; Ølgaard et al., 2008a; Ølgaard and Wells, 2010) takes as input a variational form specified in UFL and generates as output C++ code which conforms to the UFC interface, to be described in Section 1.3.4. Central to the finite element method is the assembly of sparse tensors, described in Section 1.3.5, which relies on the computation of the local element tensor AT as well as the local-to- global mapping ιT. Although it is possible to hand code AT and ιT, the process is both tedious and error-prone especially for complex problems. This issue is eliminated by letting FFC generate the code automatically. Introducing a compiler also provides the possibility of applying various optimisation strategies for efficient computation of AT, which would normally not be feasible when developing code by hand. The automated code generation for the general and efficient solution of finite element variational problems is one of the key features of FEniCS. There are three different interfaces to FFC: a Python interface, a just-in-time (JIT) compilation interface and a command-line interface. Only the latter is presented here while details on the other two interfaces can be found in Logg et al.(2012c). The command-line interface takes a UFL form file or a list of form files as input:
Bash code $ ffc Poisson.ufl
The form file contains the UFL specification of elements and/or forms, as for instance the code from Figure 1.2 which in this case is saved in the file Poisson.ufl. The content of a form file is wrapped in a Python script and then executed for further processing in FFC. There exist a number of optional command-line options to control the code generation. Related to this work, the most important options are: -l language This parameter controls the output format for the generated code. The default value is “ufc”, which indicates that the code is generated according to the UFC specification. Alternatively, the value “dolfin” may be used to generate code according to the UFC format with a small set of additional DOLFIN-specific wrappers.
-r representation This parameter controls the representation used for the gener- ated element tensor code. There are three possibilities: “auto” (the default), “quadrature” and “tensor”. FFC implements two different approaches to code generation. One is based on traditional quadrature and another on a special tensor representation. This will be discussed in Section 3.2. In the case “auto”, FFC will try to select the better of the two representations; that 12 Chapter 1. Introduction
is, the representation that is believed to yield the best run-time performance for the problem at hand. This issue is addressed in detail in Section 3.6.
-O If this option is used, the code generated for the element tensor is optimised for run-time performance. The optimisation strategy used depends on the chosen representation. In general, this will increase the time required for FFC to generate code, but should reduce the run-time for the generated code. Note that for very complicated variational forms, hardware limitations can make compilation with some optimisation options impossible. Optimisation strategies are treated in Chapter3. As an illustration of the options presented above, the command:
Bash code $ ffc -l dolfin -r quadrature -O Poisson.ufl will cause FFC to generate code for the Poisson problem, including DOLFIN wrappers using the quadrature representation with the default optimisation. A list of all available command-line parameters can be seen in FFC manual page by typing ‘man ffc’ on the command-line. FFC follows the conventional design of a compiler in that it breaks compilation into several sequential stages. The output generated at each stage serves as input for the following stage, as illustrated in Figure 1.3. Introducing separate stages allows development and improvement of each stage to be implemented without affecting other stages of the compilation. Furthermore, adding new stages and dropping existing stages becomes trivial. Each of the stages involved when compiling a form is described in the following. Compilation of elements follow a similar (but simpler) set of stages, and is not described here. Compiler stage 0: Language (parsing). In this stage, the user-specified form is interpreted and stored as a UFL abstract syntax tree (AST). The actual pars- ing is handled by Python and the transformation to a UFL form object is implemented by operator overloading in UFL. Input: Python code or .ufl file Output: UFL form Compiler stage 1: Analysis. This stage preprocesses the UFL form and extracts form metadata (FormData), such as which elements were used to define the form, the number of coefficients and the cell type (interval, triangle or tetrahedron). This stage also involves selecting a suitable quadrature scheme and representation (as discussed earlier) for the form if these have not been specified by the user. Input: UFL form Output: preprocessed UFL form and form metadata 1.3. The FEniCS Project 13
Figure 1.3: Compilation of Foo.ufl finite element variational forms broken into six se- quential stages: Language, Analysis, Representation, Stage 0 Optimisation, Code gen- Language eration and Code Format- ting. Each stage gen- UFL erates output based on input from the previous Stage 1 stage. The input/output Analysis data consist of a UFL form file, a UFL object, a UFL object and metadata com- UFL + metadata puted from the UFL ob- ject, an intermediate rep- Stage 2 resentation (IR), an opti- Representation mised intermediate repre- sentation (OIR), C++ code IR and, finally, C++ code files (from Logg et al.(2012c)). Stage 3 Optimization
OIR
Stage 4 Code generation
C++ code
Stage 5 Code formatting
Foo.h / Foo.cpp 14 Chapter 1. Introduction
Compiler stage 2: Code representation. Most of the complexity of compilation is handled in this stage which examines the input and generates all data needed for the code generation. This includes generation of finite element basis func- tions, extraction of data for mapping of degrees of freedom, and generation of the form representation, see Section 3.2, which may involve precomputation of integrals. Both representations available in FFC use tabulated values of finite element basis functions and their derivatives at a suitable set of inte- gration points on the reference element. FFC itself does not generate these values, but relies on the library FIAT (Kirby, 2004, 2012) for the computation of basis functions and their derivatives. The intermediate representation is stored as a Python dictionary, mapping names of UFC functions to the data needed for generation of the correspond- ing code. In simple cases, like ufc::form::rank, this data may be a simple number like 2. In other cases, like ufc::cell_tensor::tabulate_tensor, the data may be a complex data structure that depends on the choice of form representation. Input: preprocessed UFL form and form metadata Output: intermediate representation (IR) Compiler stage 3: Optimisation. This stage examines the intermediate representa- tion and performs optimisations. The optimisation strategy depends on the chosen form representation, see Section 3.3 for optimisations pertinent to the quadrature representation. Data stored in the intermediate representation dictionary is then replaced by new data that encode an optimised version of the function in question. Input: intermediate representation (IR) Output: optimised intermediate representation (OIR) Compiler stage 4: Code generation. This stage examines the optimised intermedi- ate representation and generates the actual C++ code for the body of each UFC function. The code is stored as a dictionary, mapping names of UFC functions to strings containing the C++ code. As an example, the data generated for ufc::form::rank may be the string “return 2;”. This demonstrates the importance of separating stages 2, 3 and 4 as it allows stages 2 and 3 to focus on algorithmic aspects related to finite elements and variational forms, while stage 4 is concerned only with generating C++ code from a set of instructions prepared in earlier compilation stages. Input: optimised intermediate representation (OIR) Output: C++ code Compiler stage 5: Code formatting. This stage examines the generated C++ code and formats it according to the UFC format, generating as output one or more 1.3. The FEniCS Project 15
ufc::mesh ufc::function ufc::cell_integral ufc::cell ufc::finite_element ufc::exterior_facet_integral ufc::form ufc::dofmap ufc::interior_facet_integral
Table 1.3: C++ classes defined in the UFC interface.
.h/.cpp files conforming to the UFC specification. This is where the actual writing of C++ code takes place and the stage relies on templates for UFC code available as part of the UFC module ufc_utils. Input: C++ code Output: C++ code files The interface to the code which is generated by FFC is discussed in the following section.
1.3.4 Unified Form-assembly Code The purpose of Unified Form-assembly Code (UFC) (Alnæs et al., 2009, 2012) is to provide an interface between the problem-specific code generated by form compilers and general-purpose problem solving environments like DOLFIN (described in Section 1.3.5) which implements, among other things, the finite element assembly algorithm. In contrast to other FEniCS components, few changes are made to UFC in order maintain a stable interface between form compilers and DOLFIN. This section gives a brief introduction to the interface, with emphasis on the functions relevant for this work. Furthermore, the UFC numbering convention for mesh entities is discussed. The UFC interface provides a small set of abstract C++ classes, shown in Ta- ble 1.3 which are commonly used for assembling finite element tensors. The mesh and cell classes are simple data structures that provide information such as the geometric dimension and the topological dimension. In addition, the cell class provides an array of global indices for the mesh entities belonging to the given cell (cell.mesh_entities) and an array of coordinates of the vertices of the cell (cell.coordinates). The classes function and finite_element define interfaces for general tensor-valued functions and finite elements respectively. The form class defines an interface for assembly of the global tensor correspond- ing to the given form. This includes functions to create finite_element, dofmap and integral objects (ufc::cell_integral, ufc::exterior_facet_integral and ufc::interior_facet_integral) of the variational form. Of particular interest in relation to this work are the dofmap and integral classes. The local-to-global degree of freedom mapping on the finite element cell T, ιT, is computed by the dofmap::tabulate_dofs function for which the UFC interface is defined as: 16 Chapter 1. Introduction
C++ code /// Tabulate the local-to-global mapping of dofs ona cell virtual void tabulate_dofs(unsigned int* dofs, const mesh& m, const cell& c) const = 0; where dofs is a pointer to an array for the tabulated values on T. UFC only provides the interface of this function, it is not concerned with computing ιT. The code to compute ιT must be generated by the form compiler. For example, FFC will generate the following code for linear Lagrange elements on triangles.
C++ code /// Tabulate the local-to-global mapping of dofs ona cell virtual void tabulate_dofs(unsigned int* dofs, const ufc::mesh& m, const ufc::cell& c) const { dofs[0]= c.entity _indices[0][0]; dofs[1]= c.entity _indices[0][1]; dofs[2]= c.entity _indices[0][2]; }
Note that FFC associates each degree of freedom with the global vertex number which can be extracted from the cell::entity_indices array. For discontinuous linear Lagrange elements on triangles the generated code is
C++ code /// Tabulate the local-to-global mapping of dofs ona cell virtual void tabulate_dofs(unsigned int* dofs, const ufc::mesh& m, const ufc::cell& c) const { dofs[0]=3 *c.entity_indices[2][0]; dofs[1]=3 *c.entity_indices[2][0]+ 1; dofs[2]=3 *c.entity_indices[2][0]+ 2; } because FFC considers all degrees of freedom local to the given element and therefore compute degree of freedom numbers based on the global cell index. The local finite element tensor is computed inside the tabulate_tensor function which is implemented by all three integral classes although the interface varies slightly. For the cell_integral, the interface is
C++ code /// Tabulate the tensor for the contribution froma local cell virtual void tabulate_tensor(double* A, const double * const * w, const cell& c) const = 0; where A is a pointer to an array which will hold the values of the local ele- ment tensor and w contains nodal values of any coefficient functions present 1.3. The FEniCS Project 17
v2
Vertex Coordinates v0 x = (0, 0) v1 x = (1, 0) v2 x = (0, 1)
v0 v1
Figure 1.4: The UFC reference triangle and the coordinates of the vertices. in the integral. The code which FFC generates for this function varies depend- ing on, for example, the choice of representation and optimisation, issues which are discussed in Chapter3. (Figures 3.2 and 3.3, on page 63 and 65 respec- tively, show examples of code generated by FFC for this function.) The inter- face for exterior_facet_integral::tabulate_tensor is similar in nature to the interface for interior_facet_integral::tabulate_tensor which is discussed in Section 4.1.2 in connection to automation of discontinuous Galerkin methods. The UFC specification also defines a numbering scheme for mesh entities which allows form compilers to access necessary data consistently when generating code, for example, for computing the local tensors and local-to-global mapping as discussed above. Important aspects of this numbering scheme are summarised in the following for triangular cells. Further details on the UFC numbering convention can be found in Alnæs et al.(2012). The UFC reference triangle, including the coordinates of the three vertices, is shown in Figure 1.4. Mesh entities are identified by the tuple (d, i) where d is the topological dimension of the mesh entity and i is a unique global index of the mesh entity. For convenience, mesh entities of topological dimension 0 are referred to as vertices, entities of dimension 1 are referred to as edges and entities of dimension 2 are referred to as faces. Mesh entities of topological dimension D 1 (codimension 1), − with D denoting the topological dimension of the cells of the computational mesh, are referred to as facets. Thus for a triangular mesh, the facets are the edges and for a tetrahedral mesh, the facets are the faces. Following this convention, the vertices of a triangle are identified as v0 = (0, 0), v1 = (0, 1) and v2 = (0, 2), the edges (facets) are e0 = (1, 0), e1 = (1, 1) and e2 = (1, 2), and the cell itself is c0 = (2, 0). The vertices of simplicial cells (intervals, triangles and tetrahedra) are numbered locally based on the corresponding global vertex numbers such that a tuple of increasing local vertex numbers corresponds to a tuple of increasing global vertex numbers. This is illustrated for a simple mesh in Figure 1.5. The remaining mesh entities are numbered within each topological dimension based on a lexicographical 18 Chapter 1. Introduction
2 3
v1 v2 v2
v0 v0 v1 0 1
Figure 1.5: Local vertex numbering of simplicial mesh based on global vertex numbers.
Entity Incident vertices Non-incident vertices v0 = (0, 0)(v0)(v1, v2) v1 = (0, 1)(v1)(v0, v2) v2 = (0, 2)(v2)(v0, v1) e0 = (1, 0)(v1, v2)(v0) e1 = (1, 1)(v0, v2)(v1) e2 = (1, 2)(v0, v1)(v2) c0 = (2, 0)(v0, v1, v2) ∅
Table 1.4: Local numbering of mesh entities on triangular cells.
ordering of ordered tuples of non-incident vertices. For example, the first edge, e0, of a triangle is located opposite vertex v0 as shown in Figure 1.6a. The numbering of mesh entities on triangular cells is shown in Table 1.4. The relative ordering of mesh entities with respect to other incident mesh entities follows by sorting the entities by their indices. Therefore, the pair of vertices incident to edge e0 in Figure 1.6a is (v1, v2), not (v2, v1). Due to the vertex numbering convention, this means that two incident simplicial cells will always agree on the orientation of incident subsimplices (for instance facets). This is demonstrated in Figure 1.6b, which shows two incident triangles which agree on the orientation of the common edge. This feature is advantageous when generating code for discontinuous Galerkin methods, as will be demonstrated in Chapter4.
1.3.5 DOLFIN Up until now, only the variational form and finite element discretisation has been defined. To obtain a solution to the boundary value problem in (1.1) the computational domain and boundary conditions must be specified which in the 1.3. The FEniCS Project 19
2 3 v v v2 1 2 v2
e2 e0 e0
v0 v0 v1
v0 v1 0 1 (a) Edges are numbered based on the non- (b) Orientation of facets (edges) are defined incident vertex. Therefore, e0 is located op- by the ordered tuple of incident vertices thus posite vertex v0. e0 = (v0, v2) and e2 = (v0, v1).
Figure 1.6: Edge numbering and orientation based on sorted tuples of incident and non-incident vertices. As a consequence two incident triangles will always agree on the orientation of the common facet for simplicial cells. context of FEniCS is handled via a component called DOLFIN, a C++/Python library, which also provides algorithms for finite element assembly and linear algebra functionality to solve the arising system of equations. DOLFIN provides a problem solving environment and is the main user interface to FEniCS. A detailed presentation of DOLFIN is outside the scope of this work but can be found in Logg and Wells(2010) and Logg et al.(2012d). The necessary DOLFIN functionality to implement a complete solver for the Poisson problem is presented. The intention is to give an impression of the possibilities that are offered by DOLFIN and an understanding of the basic concepts that are developed and used in subsequent chapters. For the model problem under consideration the domain Ω = [0, 1] [0, 1], in which the source term f = × 8π2 sin(2πx) sin(2πy) is present, is subjected to homogeneous Dirichlet boundary conditions, g = 0 on ΓD = ∂Ω. A complete C++ solver for this problem is shown in Figures 1.7 and 1.8. The first line in Figure 1.7 includes the DOLFIN library, while the second line includes the UFC conforming code generated by FFC based on the UFL input for the Poisson problem shown in Figure 1.2. Then follows the definition of the class Source which is a subclass of the Expression class. An Expression represents a function that can be evaluated on a finite element space and to suit this purpose it implements an eval function. This function takes as arguments an array of values which holds the return values and an array x which contains the coordinates of the point where the Expression is currently being evaluated. The Source class overloads the eval function, which in this case simply inserts the value 8π2 sin(2πx) sin(2πy) into the 20 Chapter 1. Introduction
C++ code #include
using namespace dolfin;
// Source term class Source: public Expression { void eval(Array
// Sub domain for Dirichlet boundary condition class DirichletBoundary: public SubDomain { bool inside(const Array
Figure 1.7: Implementation of source term and Dirichlet boundary for the C++ solver for the boundary value problem in (1.1). Program continues in Figure 1.8. 1.3. The FEniCS Project 21 values array. Next follows the definition of the class DirichletBoundary, a subclass of SubDomain, for the part of the boundary where Dirichlet boundary conditions are to be applied. The SubDomain class implements the function inside which eval- uates to true or false depending on whether or not the point given by coordinates x is part of the subdomain. In addition to the argument x, the inside function also takes the argument on_boundary, a boolean value, supplied by DOLFIN, which is true if the point x is located on ∂Ω. In the given case, the Dirichlet condition is indeed applied on ∂Ω which means that the overloaded inside function can simply be implemented by returning the on_boundary argument. The remaining part of the C++ solver, the main function, is shown in Figure 1.8. The first line defines the computational mesh and consists of 2048 triangles as the unit square is divided into 32 32 cells and each cell is divided into two 2 × triangles. DOLFIN provides functionality for creating simple meshes through the classes: UnitInterval, UnitSquare, UnitCube, UnitCircle, UnitSphere, Interval, Rectangle and Box which are useful for testing. For ‘real’ applications, a user can read a mesh from file in the following way:
C++ code Mesh mesh("mesh.xml"); provided that the mesh is saved in the DOLFIN XML format. Meshes can be generated by external libraries, such as Gmsh (http://geuz.org/gmsh/), stored in the Gmsh data format and converted by the dolfin-convert script to the DOLFIN XML data format. Next, the FunctionSpace is defined for the finite element function space Vh in (1.7). A function space is represented by a Mesh, a DofMap and a FiniteElement. The DofMap and FiniteElement classes are generated by FFC based on the element definition in Figure 1.2. However, by including the ‘-l dolfin’ option when compiling the UFL input with FFC:
Bash code ffc -l dolfin Poisson.ufl the DOLFIN wrappers are generated, permitting a user to instantiate a function space simply by providing the mesh as argument to the constructor. The next three lines define an object for the Dirichlet boundary condition u = g = 0 on the boundary ΓD defined by the DirichletBoundary class from Figure 1.7. The value g = 0 is simply represented as a constant. Then follows the creation of the bilinear and linear forms of the Poisson prob- lem using the function space V as argument. The Poisson::BilinearForm and Poisson::LinearForm classes are part of the code in Poisson.h generated by FFC from the UFL input in Figure 1.2. Note how the coefficients f and h are defined 22 Chapter 1. Introduction
C++ code int main() { // Create mesh and function space UnitSquare mesh(32, 32); Poisson::FunctionSpace V(mesh);
// Define boundary condition Constant g(0.0); DirichletBoundary boundary; DirichletBC bc(V, g, boundary);
// Define variational forms Poisson::BilinearForm a(V, V); Poisson::LinearForm L(V); Source f; L.f= f; Constant h(0.0); L.h= h;
// Compute solution Function u(V); Matrix A; Vector b; assemble(A, a); assemble(b, L); bc.apply(A, b); solve(A, *u.vector(), b);
// Save solution in PVD format File file("poisson.pvd"); file<< u;
// Plot solution plot(u);
return 0; }
Figure 1.8: Continuation from Figure 1.7 of C++ code for the Poisson boundary value problem. 1.3. The FEniCS Project 23 and attached to the linear form. The coefficient f is defined by the class Source as shown in Figure 1.7, while the coefficient h, the Neumann boundary condition, is zero in the given case. A Function u is then declared to hold the computed solution. The Function class represents a finite element function in V and therefore takes a function space as argument. The function u also holds a vector of values of the degrees of freedom associated with the function. A function is evaluated based on linear combinations of basis functions and the values of this vector. This is in contrast to the Expression class which is evaluated by overloading the eval function as seen in Figure 1.7. To compute a solution for u which satisfies the variational problem, defined by the bilinear and linear forms a and L, the following three steps are applied. Firstly, the bilinear and linear forms a and L are assembled into the Matrix A and the Vector b by calling the free function assemble which implements an algorithm to assemble finite element variational forms. The assembly algorithm will be presented later in this section. Secondly, the Dirichlet boundary condition is applied to the linear system of equations using the apply member function of the DirichletBC object bc. Thirdly, after applying the boundary condition, the system of equations can be solved by calling the free function solve which solves linear systems on the form Ax = b using the assembled matrix A, the vector of degree of freedom values from u and the assembled vector b as arguments. As an alternative to the three steps outlined above, the solve function provides functionality to solve variational problems in a straightforward fashion namely by:
C++ code solve(a== L, u, bc); which automatically assembles the system, applies the boundary conditions and solves the linear system which is stored in the function u. Finally, the solution is saved in ParaView Data (PVD) format (http://www. paraview.org/) for external post processing and plotted by the built-in plot com- mand of DOLFIN which enables a quick visual inspection of the computed solution. The computed solution to the Poisson boundary value problem is shown in Fig- ure 1.9.
Python interface
As already mentioned, DOLFIN also provides a Python interface as an alternative to the C++ interface. Most of the Python interface is generated automatically from the C++ interface using SWIG (http://www.swig.org/). In addition, the Python interface offers seamless integration with UFL and FFC through just-in-time compilation of variational forms and elements which, in combination with the expressiveness of Python, allows solvers to be implemented very compactly. For 24 Chapter 1. Introduction
Figure 1.9: Computed solution to the Poisson boundary value problem. The warped scalar field u in the figure on the right has been scaled by a factor of 0.5. this reason, the Python interface to DOLFIN is preferred, whenever feasible, for the examples presented in this thesis. As an example, the complete solver for the Poisson boundary value problem using the Python interface is shown in Figure 1.10. The code is very similar to the C++ code in Figures 1.7 and 1.8 and the differences are mainly due to the difference in Python and C++ syntax. The two main differences are the definition of the FunctionSpace and the definition of the variational forms which are implemented directly as part of the solver and not in a separate file. Also note that the UFL coordinates have been used to implement the source term f directly as part of the variational formulation. It could also be implemented by subclassing the Expression class and overloading the eval function in a similar way to the approach in the C++ example:
Python code class Source(Expression): def eval(self, values, x): values[0]=8 *pi**2*sin(2*pi*x[0])*sin(2*pi*x[1]) f= Source()
As an alternative, it could be implemented by:
Python code f= Expression("8 *pow(pi,2)*sin(2*pi*x[0])*sin(2*pi*x[1])") where the string argument to the Expression class is given in C++ syntax which is automatically just-in-time compiled in order to evaluate the Expression. Compared to the subclassing approach, this is more efficient as the callback to the eval function 1.3. The FEniCS Project 25
Python code from dolfin import *
# Create mesh and define function space mesh= UnitSquare(32, 32) V= FunctionSpace(mesh,"Lagrange", 1)
# Define Dirichlet boundary def boundary(x, on_boundary): return on_boundary
# Define boundary condition g= Constant(0.0) bc= DirichletBC(V, g, boundary)
# Define variational problem u= TrialFunction(V) v= TestFunction(V) x= V.cell().x f=8 *pi**2*sin(2*pi*x[0])*sin(2*pi*x[1]) a= inner(grad(u), grad(v)) *dx L=f *v*dx
# Compute solution U= Function(V) solve(a== L, U, bc)
# Save solution in PVD format file= File("poisson.pvd") file<
# Plot solution plot(U, interactive=True)
Figure 1.10: Complete Python solver for the boundary value problem in (1.1). 26 Chapter 1. Introduction will take place in C++ rather than Python.
Assembly algorithm
To conclude this short introduction to the FEniCS Project, the assembly algorithm, implemented in the DOLFIN assemble function, is presented. The presentation is given for the assembly of the rank two tensor corresponding to the bilinear form of the Poisson problem in (1.2). A generalisation of the algorithm for multilinear forms is given in Alnæs et al.(2009) and Logg et al.(2012b). Setting the function space Vˆ = V, the tensor A which arises from assembling the bilinear form a is defined by AI = a φI2 , φI1 , (1.9)
N where I = (I1, I2) is a multi-index and φk k=1 is a basis for V. The tensor A is a sparse rank two tensor, a matrix, of dimensions N N. The matrix A is computed × by iterating over the cells of the mesh and adding the contribution from each local cell to the global matrix A. In this case, from (1.3), the local cell tensor AT is defined as: Z T T AT,i = aT φi , φi = u v dx, (1.10) 2 1 T ∇ · ∇ where i = (i1, i2) is a multi-index AT,i is the ith entry of the cell tensor AT, aT is 3 n To the local contribution to the form from a cell T h and φ is the local finite ∈ T k k=1 element basis for V on T, which is linear Lagrange elements on triangles in this case. To formulate the assembly algorithm, a local-to-global mapping of degrees of freedom is needed. Let ι : denote the collective local-to-global mapping T IT → I for each T ∈ Th ι (i) = ι1 (i ), ι2 (i ) i , (1.11) T T 1 T 2 ∀ ∈ IT j where ι : [1, 3] [1, N] denotes the local-to-global mapping for each discrete T → function space V and is the index set j IT 2 T = ∏[1, 3] = (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3) . (1.12) I j=1
That is, ιT maps a tuple of local degrees of freedom to a tuple of global degrees of freedom. DOLFIN calls the tabulate_tensor and tabulate_dofs functions presented in Section 1.3.4, in order to compute the local contribution aT and the j local-to-global mapping ιT for each discrete function space from which DOLFIN constructs the collective local-to-global mapping ιT. 1.3. The FEniCS Project 27
The assembly of the matrix A can now be carried out efficiently by iterating over all cells T . On each cell T, the cell tensor A is computed and then added ∈ Th T to the global tensor A as outlined in Algorithm1. The algorithm can be extended
Algorithm 1 Assembly algorithm. A = 0 for T ∈ Th (1) Compute ιT (2) Compute AT (3) Add AT to A according to ιT: for i ∈ IT =+ AιT (i) AT,i end for end for to handle assembly over exterior and interior facets, the latter is demonstrated in Section 4.1.4.
2 FEniCS applications to solid mechanics
One of the goals of this work is to tackle complicated solid mechanics models using automated modelling tools. In the previous chapter it was shown how automated modelling could be employed to solve the finite element variational formulation of a Poisson boundary value problem. The Poisson problem provides a simple platform for introducing the concepts behind automated modelling as it is implemented in the FEniCS framework. However, from the simple presentation it may not be immediately clear how more complex problems, like plasticity, can be solved. A natural step is, therefore, to apply the concept of automated modelling to some standard solid mechanics problems. Solid mechanics problems typically involve the standard momentum balance equation, posed in a Lagrangian setting, with different models distinguished by the choice of nonlinear or linearised kinematics, and the constitutive model for determining the stress. The traditional development approach to solid mechanics problems, and traditional finite element codes, places a strong emphasis on the implementation of constitutive models at the quadrature point level. Automated methods, on the other hand, tend to stress more heavily the governing balance equations. Widely used finite element codes for solid mechanics applications provide application programming interfaces (APIs) for users to implement their own constitutive models. The interface supplies kinematic and history data, and the user code computes the stress tensor, and when required also the linearisation of the stress. Users of such libraries will typically not be exposed to code development other than via the constitutive model API. In addition to demonstrating how solid mechanics problems can be solved using automation tools, this chapter presents some of the models that will be further investigated and extended in subsequent chapters. It is not intended as a comprehensive treatment of solid mechanics problems, but should be viewed as a stepping stone towards implementation of classes of plasticity models. The chapter also focuses on some pertinent issues that arise due to the nature of the constitutive models. These issues, and solid mechanics problems in general, have motivated a number of developments in the FEniCS framework. The common problems of linearised elasticity, plasticity, hyperelasticity and elastic wave propagation are considered. Topics that are addressed in this chapter 30 Chapter 2. FEniCS applications to solid mechanics via these problems include ‘off-line’ computation of stress updates, linearisation of problems with off-line stress updates, automatic differentiation and time stepping for problems with second-order time derivatives. The presentation starts with the relevant governing equations and the constitutive models under consideration. The important issue of solving and linearising problems in which the governing equation is expressed in terms of the stress tensor (rather than explicitly in terms of the displacement field, or derivatives of the displacement field), and the stress tensor is computed via a separate algorithm is then addressed. These topics are then followed by a number of examples that demonstrate implementation approaches of the described models. To conclude the chapter, which is primarily based on the work in Ølgaard and Wells(2012a); Ølgaard et al.(2008b), extensions of the FEniCS framework that are particular interesting with respect to solid mechanics problems, and consequently to this work, are summarised.
2.1 Governing equations
2.1.1 Preliminaries
The considered problems will be posed on a polygonal domain Ω Rd, where ⊂ 1 d 3. The boundary of Ω, denoted by ∂Ω, is decomposed into regions Γ ≤ ≤ D and Γ such that Γ Γ = ∂Ω and Γ Γ = ∅. The outward unit normal N D ∪ N D ∩ N vector on ∂Ω will be denoted by n. For time-dependent problems, a time interval of interest I = (0, T] will be considered, where superimposed dots denote time derivatives. The current configuration of a solid body is denoted by Ω; that is, the domain Ω depends on the displacement field. It is sometimes convenient to also define a reference domain Ω Rd that remains fixed. For convenience, cases in 0 ⊂ which Ω and Ω0 coincide at time t = 0 are considered. To indicate boundaries, outward unit normal vectors, and other quantities relative to Ω0, the subscript ‘0’ will be used. When considering linearised kinematics, the domains Ω and Ω0 are both fixed and coincide at all times t. A triangulation of the domain Ω will be denoted by , and a triangulation of the domain Ω will be denoted by . Th 0 Th,0 The governing equations for the different models will be formulated in the common framework of: find u V such that ∈ F(u; w) = 0 w V, (2.1) ∀ ∈ where F : V V R is linear in w and V is a suitable function space. If F is also × → linear in u, then F can be expressed as
F(u; w) := a(u, w) L(w), (2.2) − 2.1. Governing equations 31 where a : V V R is linear in u and in w, and L : V R is linear in w. For this × → → case, the problem can be cast in the canonical setting of: find u V such that ∈ a(u, w) = L(w) w V, (2.3) ∀ ∈ which is identical to the form in (1.2). For nonlinear problems, a Newton method is typically employed to solve (2.1). Linearising F about u = u0 leads to a bilinear form,
dF (u0 + edu; w) a (du, w) := Du0 F (u0; w)[du] = , (2.4) de e=0 and a residual given by: L(w) := F(u0, w). (2.5) Using the definitions of a and L in (2.4) and (2.5), respectively, a Newton step involves solving a problem of the type in (2.3), followed by the correction u 0 ← u du. The process is repeated until (2.1) is satisfied to within a specified tolerance. 0 −
2.1.2 Balance of momentum
The standard balance of linear momentum problem for the body Ω reads:
ρu¨ σ = b in Ω I, (2.6) − ∇ · × u = g on Γ I, (2.7) D × σn = h on Γ I, (2.8) N × u (x, 0) = u0 in Ω, (2.9)
u˙(x, 0) = v0 in Ω, (2.10) where ρ : Ω I R is the mass density, u : Ω I Rd is the displacement field, × → × → σ : Ω I Rd Rd is the symmetric Cauchy stress tensor, b : Ω I Rd is a × → × × → body force, g : Ω I Rd is a prescribed boundary displacement, h : Ω I Rd × → × → is a prescribed boundary traction, u : Ω Rd is the initial displacement and 0 → v : Ω Rd is the initial velocity. To complete the boundary value problem, a 0 → constitutive model that relates σ to u is required. To develop finite element models, it is necessary to cast the momentum balance equation in a weak form by multiplying the balance equation (2.6) by a weight function w and integrating. It is possible to formulate a space-time method by considering a weight function that depends on space and time, and then integrating over Ω I. However, it is far more common in solid mechanics applications to × consider a weight function that depends on spatial position only and to apply finite difference methods to deal with time derivatives. Following this approach, at a time t I equation (2.6) is multiplied by a function w (w is assumed to satisfy ∈ 32 Chapter 2. FEniCS applications to solid mechanics w = 0 on ΓD) and integrate over Ω: Z Z Z ρu¨ w dx ( σ) w dx b w dx = 0. (2.11) Ω · − Ω ∇ · · − Ω · Applying integration by parts, using the divergence theorem and inserting the boundary condition (2.8), equation (2.11) can be expressed on the form of (2.2) as: Z Z Z Z F := ρu¨ w dx + σ : w dx h w ds b w dx = 0. (2.12) Ω · Ω ∇ − ΓN · − Ω ·
In this section, the momentum balance equation has been presented on the current configuration Ω. It can also be posed on the fixed reference domain Ω0 via a pull-back operation. However, for the particular presentation which is used in this chapter for geometrically nonlinear models details of the pull-back will not be needed.
2.1.3 Potential energy minimisation An alternative approach to solving static problems (problems without an inertia term) is to consider the minimisation of potential energy. This approach leads to the same governing equation when applied to a standard problem, but may be a preferable framework for problems that are naturally posed in terms of stored energy densities and for which external forcing terms are conservative (see Holzapfel(2000, p. 159) for an explanation of conservative loading), and for problems that involve coupled physical phenomena that are best described energetically. Consider a system for which the total potential energy Π associated with a body can be expressed as Π = Πint + Πext, (2.13) where Πint is the internal potential energy stored in Ω and Πext is the energy associated with external forces acting on the domain Ω. An internal potential energy functional of the form Z Πint = Ψ0 (v) dx, (2.14) Ω0 where Ψ0 is the stored strain energy density on the reference domain, and an external potential energy functional of the form Z Z Πext = b0 v dx h0 v ds, (2.15) − Ω0 · − Γ0,N · are considered. It is the form of the stored energy density function Ψ0 that defines 2.2. Constitutive models 33 a particular constitutive model. For later convenience, the potential energy terms have been presented on the reference domain Ω0. A minimiser u of (2.13) minimises the potential energy:
min Π, (2.16) v V ∈ where V is a suitably defined function space. Minimisation of Π corresponds to the directional derivative of Π being zero for all possible variations of u. Therefore, minimisation of Π corresponds to solving (2.1) with
dΠ (u + ew) F (u; w) := DuΠ (u)[w] = . (2.17) de e=0 For suitable definitions of the stress tensor, it is straightforward to show that minimising Π is equivalent to solving the balance of momentum problem, for the static case.
2.2 Constitutive models
A constitutive model describes the relationship between stress and deformation. The stress can be defined explicitly in terms of primal functions like the displacement field for linearised elasticity, it can be implicitly defined via stored energy density functions, or it can be defined as the solution to a secondary problem for instance the yield criterion in the case of plasticity. The constitutive model can be either linear or nonlinear. In the following sections examples of these cases are presented in the form of linearised elasticity, plasticity and hyperelasticity. The expressions for the stress or stored energy density presented in this section can be inserted into the balance equations or the minimisation framework in the preceding section to yield a governing equation.
2.2.1 Linearised elasticity
For linearised elasticity, the stress tensor as a function of the strain tensor for an isotropic, homogeneous material is given by
σ = 2µε + λtr(ε)I, (2.18) where ε = u + ( u)T /2 is the strain tensor, µ and λ are the Lamé parameters ∇ ∇ and I is the second-order identity tensor. The relationship between the stress and the strain can also be expressed as
σ = : ε, (2.19) C 34 Chapter 2. FEniCS applications to solid mechanics where = µ δ δ + δ δ + λδ δ , (2.20) Cijkl ik jl il jk ij kl and δij is the Kronecker-Delta.
2.2.2 Flow theory of plasticity The standard flow theory model of plasticity is considered, and only the background necessary to support the examples will be presented. In depth coverage can be found in many textbooks, such as Lubliner(2008) and Simo and Hughes(1998). For a geometrically linear plasticity problem, the stress tensor is computed by
σ = : εe, (2.21) C where εe is the elastic part of the strain tensor. It is assumed that the strain tensor can be decomposed additively into elastic and plastic parts:
ε = εe + εp. (2.22)
If εe can be determined, then the stress can be computed. The stress tensor in classical plasticity models must satisfy the yield criterion: f (σ, εp, κ) := φ σ, q (εp) q (κ) σ 0, (2.23) kin − iso − y 6 p where φ σ, qkin (ε ) is a scalar effective stress measure, qkin is a stress-like internal variable used to model kinematic hardening, qiso is a scalar stress-like term used to model isotropic hardening, κ is a scalar internal variable and σy is the initial scalar yield stress. For the commonly adopted von Mises model (also known as J2-flow) with linear isotropic hardening, φ and qiso read: r 3 φ (σ) = s s , (2.24) 2 ij ij qiso (κ) = Hκ, (2.25) where s = σ σ δ /3 is the deviatoric stress and the constant scalar H > 0 is a ij ij − kk ij hardening parameter. In the flow theory of plasticity, the plastic strain rate is given by:
∂g ε˙p = λ˙ , (2.26) ∂σ where λ˙ is the rate of the plastic multiplier and the scalar g is known as the plastic potential. In the case of associative plastic flow, g = f . The term λ˙ determines the magnitude of the plastic strain rate, and the direction is given by ∂g/∂σ. For 2.2. Constitutive models 35 isotropic strain-hardening, it is usual to set r 2 p p κ˙ = ε˙ ε˙ , (2.27) 3 ij ij which for associative von Mises plasticity implies that κ˙ = λ˙ . A feature of the flow theory of plasticity is that the constitutive model is postulated in a rate form. This requires the application algorithms to compute the stress from increments of the total strain. A discussion of algorithmic aspects on how the stress tensor can be computed from the equations presented in this section is postponed to Section 2.4.2.
2.2.3 Hyperelasticity
Hyperelastic models are characterised by the existence of a stored strain energy density function Ψ0. The linearised model presented at the start of this section falls within the class of hyperelastic models. Assuming linearised kinematics, the stored energy function λ Ψ = (tr ε)2 + µε : ε (2.28) 0 2 corresponds to the linearised model in (2.18). It is straightforward to show that using this stored energy function in the potential energy minimisation approach in (2.17) leads to the same equation as inserting the stress from (2.18) into the weak momentum balance equation (2.12). More generally, stored energy functions that correspond to nonlinear models can be defined. A wide range of stored energy functions for hyperelastic models have been presented and analysed in the literature (see, for example, Bonet and Wood(1997) for a selection). In order to present concrete examples, it is necessary to introduce some kinematics, and in particular strain measures. The Green–Lagrange d d strain tensor E is defined in terms of the deformation gradient F : Ω0 I R R sym × → × and the right Cauchy–Green tensor C : Ω I Rd Rd : 0 × → × F = I + u, (2.29) ∇ C = FTF, (2.30) 1 E = (C I) , (2.31) 2 − where I is the second-order identity tensor. Using E in place of the infinitesimal strain tensor ε in (2.28), the following expression for the strain energy density function is obtained: λ Ψ = (tr E)2 + µE : E, (2.32) 0 2 36 Chapter 2. FEniCS applications to solid mechanics which is known as the St. Venant–Kirchhoff model. Unlike the linearised case, this energy density function is not linear in u (or spatial derivatives of u), which means that when minimising the total potential energy Π, the resulting equations are nonlinear. Other examples of hyperelastic models are the Mooney–Rivlin model:
Ψ = c (I 3) + c (II 3) , (2.33) 0 1 C − 2 C − where I = tr C and II = 1 I2 tr C2 and the compressible neo-Hookean C C 2 C − model: µ λ Ψ = (I 3) µ ln J + (ln J)2 , (2.34) 0 2 C − − 2 where J = det F. In most presentations of hyperelastic models, one would proceed from the definition of the stored energy function to the derivation of a stress tensor, and then often to a linearisation of the stress for use in a Newton method. This process can be lengthy and tedious. For a range of models, features of UFL will permit problems to be posed as energy minimisation problems, and it will not be necessary to compute an expression for a stress tensor, or its linearisation, explicitly. A particular model can then be posed in terms of a particular expression for Ψ0, as will be demonstrated in the example in Section 2.4.3. It is also possible to follow the momentum balance route, in which case UFL can be used to compute the stress tensor and its linearisation automatically from an expression for Ψ0.
2.3 Linearisation issues for complex constitutive models
Solving problems with nonlinear constitutive models, such as plasticity, using Newton’s method requires linearisation of (2.12). There are two particular issues that deserve attention. The first is that if the stress σ is computed via some algorithm, then proper linearisation of F requires linearisation of the algorithm for computing the stress, and not linearisation of the continuous problem. This point is well known in computational plasticity, and has been extensively studied (Simo and Taylor, 1985). The second issue is that the stress field, and its linearisation, will not in general come from a finite element space. Hence, if all functions are assumed to be in a finite element space, or are interpolated in a finite element space, suboptimal convergence of a Newton method will be observed. This is illustrated in the following sections.
2.3.1 Consistency of linearisation Consider the following one-dimensional problem: Z F (u; w) := σw,x dx, (2.35) Ω 2.3. Linearisation issues for complex constitutive models 37 where the scalar stress σ is a nonlinear function of the strain field u,x, and will be computed via a separate algorithm. A continuous, piecewise quadratic displace- ment field (and likewise for w) is considered. The strain field u,x is computed via an L2-projection onto the space of discontinuous, piecewise linear elements. For the considered spaces, this is equivalent to a direct evaluation of the strain. Because the stress is computed via a separate algorithm based on nodal values from the strain field, it is chosen to also represent the stress using a discontinuous, piecewise linear basis. Since the polynomial degree of the integrand is two, (2.35) can be integrated exactly using two Gauss quadrature points on an element T : ∈ Th 2 2 f := ψT x σ φT x W , (2.36) T,i1 ∑ ∑ α q α i1,x q q q=1 α=1 where q is the integration point index, α is the degree of freedom index for the local basis of σ, ψT and φT denotes the linear and quadratic basis functions on the element T, respectively, and Wq is the quadrature weight at integration point xq. Note that σα is the computed value of the stress at the element node α. To apply a Newton method, the Jacobian (linearisation) of (2.36) is required. ? This will be denoted by AT,i. To achieve quadratic convergence of a Newton method, the linearisation must be exact. The Jacobian of (2.36) is:
? d fT,i1 AT,i := , (2.37) dui2 where ui2 are the displacement degrees of freedom. Because the stress is computed from the strain field u,x, only σα in (2.36) depends on dui2 , and the linearisation of this term reads: dσα dσα dεα dεα = = Dα , (2.38) dui2 dεα dui2 dui2 where Dα is the tangent value at node α. To compute the values of the strain at nodes, εα, from the displacement field, the derivative of the displacement field is evaluated at xα: 3 ε = φT (x ) u . (2.39) α ∑ i2,x α i2 i2=1 Inserting (2.38) and (2.39) into (2.37) yields:
2 2 A? = ψT(x )D φT (x )φT (x )W . (2.40) T,i ∑ ∑ α q α i2,x α i1,x q q q=1 α=1
This is the exact linearisation of (2.36). The linearisation of the weak form (2.35) is now considered, which leads to the 38 Chapter 2. FEniCS applications to solid mechanics bilinear form: Z a(u, w) := Du,x w,x dx, (2.41) Ω where D = dσ/dε is the tangent. As before, D is represented using a discontinuous, piecewise linear basis where the nodal values of D are computed via a separate algorithm. If two quadrature points are used to integrate the form (which is exact for this form), the resulting element matrix is:
2 2 A = ψT(x )D φT (x )φT (x )W . (2.42) T,i ∑ ∑ α q α i2,x q i1,x q q q=1 α=1
The representation of the element matrix in (2.42) is what would be produced by FFC. Equations (2.40) and (2.42) are not identical since φT is being evaluated in i2,x different locations (x = x in general). As a consequence, the bilinear form in q 6 α (2.42) is not an exact linearisation of (2.35), and a Newton method will therefore exhibit suboptimal convergence. For the special case where a continuous, piecewise linear basis is used for u and w and a discontinuous, piecewise constant basis is used for the strain, stress and tangent fields, only one integration point is needed and thus xq = xα which makes the linearisation exact. In general, the illustrated problem arises when some coefficients in a form are computed by a nonlinear operation elsewhere, and then interpolated and evaluated at points that differ from where the coefficient values were computed. This situation is different from the use of nonlinear operators in UFL (see Table 1.2, page 10). An example of such an operator is the ‘ln J’ term in the neo-Hookean model (2.34) where ‘J’ will be computed at quadrature points during assembly after which the operator ‘ln’ is applied to compute ‘ln J’. The linearisation issue highlighted in this section is further illustrated in the following section, as too is a solution in the context of automated modelling that involves the definition of so-called ‘quadrature elements’.
2.3.2 Quadrature elements Before introducing the concept of quadrature elements, a model problem that will be used in numerical examples is presented. Given the finite element space n o V := w H1(Ω), w P (T) T , (2.43) ∈ 0 ∈ k ∀ ∈ Th where Ω R and k 1, the model problem of interest involves: given f V, find ⊂ ≥ ∈ u V such that ∈ Z Z 2 F := 1 + u u,xw,x dx f w dx = 0 w V. (2.44) Ω − Ω ∀ ∈ 2.3. Linearisation issues for complex constitutive models 39
Solving this problem via Newton’s method involves solving a series of linear problems with Z Z 2 L (w) := 1 + un un,xwn,x dx f w dx, (2.45) Ω − Ω Z Z 2 a (dun+1, w) := 1 + un dun+1,xw,x dx + 2unun,xdun+1w,x dx, (2.46) Ω Ω with the update u u du . To draw an analogy with complex constitutive n ← n − n+1 models, the above is rephrased as: Z Z L (w) := σnw,x dx f w dx, (2.47) Ω − Ω Z Z a (dun+1, w) := Cndun+1,xw,x dx + 2unun,xdun+1w,x dx, (2.48) Ω Ω
2 2 where σn = 1 + un un,x and Cn = 1 + un . Apart from the second term in the bilinear form, the forms now resemble those for a plasticity problem where σ is the ‘stress’, C is the ‘tangent’ and u,x is the ‘strain’. Similar to a plasticity problem, the idea is to compute nodal values of σ and C ‘off-line’, and to supply σ and C as functions in a space W to the forms used in the Newton solution process. To access un,x for use off-line, an approach is to perform an L2-projection of the derivative of u onto a space W. For the example in question, the term 1 + u2 will also be projected onto W. A natural choice would be to make W one polynomial order less that V and discontinuous across cell facets. However, following this approach leads to a convergence rate for a Newton solver that is less than the expected quadratic rate. The reason is that the linearisation that follows from this process is not consistent with the problem being solved as explained in the previous section. To resolve this issue within the context of UFL and FFC, the concept of quadrature elements has been developed1. This special type of element is used to represent ‘functions’ that can only be evaluated at particular points (quadrature points), and cannot be differentiated, but can be integrated (approximately). In the remainder of this section key features of the quadrature element are presented together with a demonstration of its use for the model problem considered above. A quadrature element is declared in UFL by: UFL code element= FiniteElement("Quadrature", tetrahedron, k)
1The concept was introduced in Ølgaard et al.(2008b) although the syntax for declaring a ‘quadrature element’ and the underlying interpretation has changed slightly. Specifically, the argument k used to refer to the number of integration points in each spacial direction of the quadrature scheme, which is different from the current interpretation in which it refers to the polynomial degree that the underlying quadrature rule will be able to integrate exactly. 40 Chapter 2. FEniCS applications to solid mechanics where k is the polynomial degree that the underlying quadrature rule will be able to integrate exactly. The declaration of a quadrature element is similar to the declaration of any other element in UFL, as demonstrated in Section 1.3.2, and it can be used as such, with some limitations. Note, however, the subtle difference that the element order does not refer to the polynomial degree of the finite element shape functions, but instead relates to the quadrature scheme. For ‘sufficient’ integration of a second-order polynomial in three dimensions, FFC will use four quadrature points per cell. FFC interprets the quadrature points of the quadrature element as degrees of freedom where the value of a shape function for a degree of freedom is equal to one at the quadrature point and zero otherwise. This has the implication that a function that is defined on a quadrature element can only be evaluated at quadrature points. Furthermore, it is not possible to take derivatives of functions defined on a quadrature element. The following examples illustrate simple usage of a quadrature element. Con- sider the bilinear form for a mass matrix weighted by a coefficient f that is defined on a quadrature element: Z a (u, w) := f uw dx. (2.49) Ω If the test and trial functions w and u come from a space of linear Lagrange functions, the polynomial degree of their product is two. This means that the coefficient f should be defined as:
UFL code ElementQ= FiniteElement("Quadrature", tetrahedron, 2) f= Coefficient(ElementQ) to ensure appropriate integration of the form in (2.49). The reason for this is that the quadrature element in the form dictates the quadrature scheme that FFC will use for the numerical integration since the quadrature element, as described above, only has nonzero values at points that coincide with the underlying quadrature scheme of the quadrature element. Thus, if the degree of ElementQ is set to one, the form will be integrated using only one integration point, since one point is enough to integrate a linear polynomial exactly, and as a result the form is under-integrated. If quadratic Lagrange elements are used for w and u, the polynomial degree of the integrand is four, therefore the declaration for the coefficient f should be changed to:
UFL code ElementQ= FiniteElement("Quadrature", tetrahedron, 4) f= Coefficient(ElementQ)
As a final demonstration of quadrature elements, consider the DOLFIN code in 2.4. Implementations and examples 41
Iteration CG1/DG0 CG1/Q1 CG2/DG1 CG2/Q2 1 1.114e+00 1.101e+00 1.398e+00 1.388e+00 2 2.161e-01 2.319e-01 2.979e-01 2.691e-01 3 3.206e-03 3.908e-03 2.300e-02 6.119e-03 4 7.918e-07 7.843e-07 1.187e-03 1.490e-06 5 9.696e-14 3.662e-14 2.656e-05 1.242e-13 6 5.888e-07 7 1.317e-08 8 2.963e-10
Table 2.1: Computed relative residual norms after each iteration of the Newton solver for the nonlinear model problem using different elements for V and W. Quadratic convergence is observed when using quadrature elements, and when using piecewise constant functions for W, which coincides with a one-point quadra- ture element. The presented results are computed using the code in Figure 2.1 using the different combinations of function spaces.
Figure 2.1 for solving the nonlinear model problem in (2.44) with a source term f = x2 4, Dirichlet boundary conditions u = 1 at x = 0, continuous quadratic elements − for V, and quadrature elements of degree two for W. NonlinearModelProblem is a subclass of the DOLFIN class NonlinearProblem, which implements the lin- ear form F and the bilinear form J, the derivative or Jacobian of F, according to (2.5) and (2.4) respectively. The DOLFIN class NewtonSolver solves prob- lems expressed in the canonical form of (2.1) based on the information provided by the NonlinearModelProblem object. Further details on the DOLFIN classes NonlinearProblem and NewtonSolver can be found in Logg et al.(2012d). The relative residual norm after each iteration of the Newton solver for four different combinations of spaces V and W is shown in Table 2.1. Continuous, dis- continuous and quadrature elements are denoted by CGk, DGk and Qk respectively where k refers to the polynomial degree as discussed previously. It is clear from the table that using quadratic elements for V requires the use of quadrature elements for W in order to ensure quadratic convergence of the Newton solver.
2.4 Implementations and examples
This section presents implementation examples that correspond to the afore pre- sented models. Where feasible, complete solvers are presented. When this is not feasible, relevant code extracts are presented. Python examples are preferred due to the compactness of the code extracts, however, in the case of plasticity efficiency demands a C++ implementation. It is possible in the future that an efficient Python interface for plasticity problems will be made available via just-in-time compilation. 42 Chapter 2. FEniCS applications to solid mechanics
Python code from dolfin import *
# Sub domain for Dirichlet boundary condition class DirichletBoundary(SubDomain): def inside(self, x, on_boundary): return x[0]< DOLFIN _EPS and on_boundary
# Class for interfacing with the Newton solver class NonlinearModelProblem(NonlinearProblem): def __init__(self, a, L, u, C, S, W, bc): NonlinearProblem.__init__(self) self.a, self.L= a, L self.u, self.C, self.S, self.W, self.bc= u, C, S, W, bc
def F(self, b, x): assemble(self.L, tensor=b) self.bc.apply(b, x)
def J(self, A, x): assemble(self.a, tensor=A) self.bc.apply(A)
def form(self, A, b, x): C= project((1.0+ self.u **2), self.W) self.C.vector()[:]= C.vector() S= project(Dx(self.u, 0), self.W) self.S.vector()[:]= S.vector() self.S.vector()[:]= self.S.vector() *self.C.vector()
# Create mesh and define function spaces mesh= UnitInterval(8) V= FunctionSpace(mesh,"Lagrange", 2) W= FunctionSpace(mesh,"Quadrature", 2)
# Define boundary condition bc= DirichletBC(V, Constant(1.0), DirichletBoundary())
# Define source and functions f= Expression("x[0] *x[0]-4") u, C, S= Function(V), Function(W), Function(W)
# Define variational problems du, w= TrialFunction(V), TestFunction(V) L=S *Dx(w, 0)*dx-f *w*dx a=C *Dx(du, 0)*Dx(w, 0)*dx+2 *u*Dx(u, 0)*du*Dx(w, 0)*dx
# Create nonlinear problem, solver and solve problem= NonlinearModelProblem(a, L, u, C, S, W, bc) solver= NewtonSolver(); solver.solve(problem, u.vector())
Figure 2.1: DOLFIN implementation for the nonlinear model problem in (2.44) with ‘off-line’ computation of terms used in the variational forms. 2.4. Implementations and examples 43
In the code extracts, commentary is only provided for non-trivial aspects as the more generic aspects, such as the creation of meshes, application of boundary conditions and the solution of linear systems, already have been treated in the introduction to the FEniCS Project in Section 1.3.
2.4.1 Linearised elasticity This example is particularly simple since the stress can be expressed as a straightfor- ward function of the displacement field, and the expression for the stress in (2.18) can be inserted directly into (2.12). For the steady case (inertia terms are ignored), a complete solver for a linearised elasticity problem is presented in Figure 2.2. The solver in Figure 2.2 is for a simulation on a unit cube with a source term b = (1, 0, 0) and u = 0 on ∂Ω. A continuous, piecewise quadratic finite element space is used. The expressiveness of the UFL input means that the expressions for sigma and F in Figure 2.2 resemble closely the mathematical expressions used in the text for σ and F. To unify the presentation of linear and nonlinear equations, the problem in Figure 2.2 is presented in terms of F, where the UFL functions lhs (left-hand side) and rhs (right-hand side) have been used to automatically extract the bilinear and linear forms, respectively, from F (Alnæs et al., 2013).
2.4.2 Plasticity The computation of the stress tensor, and its linearisation, for the model outlined in Section 2.2.2 in a displacement-driven finite element model is rather involved. A method of computing point-wise a stress tensor that satisfies (2.23) from the strain, strain increment and history variables is known as a ‘return mapping algorithm’. Return mapping strategies are discussed in detail in Simo and Hughes(1998). A widely used return mapping approach, the ‘closest-point projection’, is summarised below for a plasticity model with linear isotropic hardening. From (2.21) and (2.22) the stress at the end of a strain increment reads:
p σ = : (ε ε ). (2.50) n+1 C n+1 − n+1 p Therefore, given εn+1, it is necessary to determine the plastic strain εn+1 in order to compute the stress. In a closest-point projection method the increment in plastic strain is computed from:
p p ∂g (σ ) ε ε = ∆λ n+1 , (2.51) n+1 − n ∂σ where g is the plastic potential function and ∆λ = λ λ . Since ∂ g is evaluated n+1 − n σ at σn+1, (2.50) and (2.51) constitute as system of coupled equations with unknowns ∆λ and σn+1. In general, the system is nonlinear. To obtain a solution, Newton’s 44 Chapter 2. FEniCS applications to solid mechanics
Python code from dolfin import *
# Create mesh mesh= UnitCube(8, 8, 8)
# Create function space V= VectorFunctionSpace(mesh,"Lagrange", 2)
# Create test and trial functions, and source term u, w= TrialFunction(V), TestFunction(V) b= Constant((1.0, 0.0, 0.0))
# Elasticity parameters E, nu= 10.0, 0.3 mu, lmbda=E/(2.0 *(1.0+ nu)), E *nu/((1.0+ nu) *(1.0- 2.0 *nu))
# Stress sigma=2 *mu*sym(grad(u))+ lmbda *tr(grad(u))*Identity(w.cell().d)
# Governing balance equation F= inner(sigma, grad(w)) *dx- dot(b, w) *dx
# Extract bilinear and linear forms fromF a, L= lhs(F), rhs(F)
# Dirichlet boundary condition on entire boundary c= Constant((0.0, 0.0, 0.0)) bc= DirichletBC(V, c, DomainBoundary())
# Set up PDE and solve u= Function(V) problem= LinearVariationalProblem(a, L, u, bcs=bc) solver= LinearVariationalSolver(problem) solver.parameters["symmetric"]= True solver.solve()
Figure 2.2: DOLFIN solver for a linearised elasticity problem on a unit cube. 2.4. Implementations and examples 45 method is employed as follows, with k denoting the iteration number. First, a ‘trial stress’ is computed: p σ = : (ε ε ). (2.52) trial C n+1 − n Subtracting (2.52) from (2.50) and inserting (2.51), the following equation is ob- tained: ∂g (σ ) R := σ σ + ∆λ : n+1 = 0, (2.53) n+1 n+1 − trial C ∂σ where Rn+1 is the ‘stress residual’. During the Newton iterations this residual is driven towards zero. If the trial stress in (2.52) leads to satisfaction of the yield criterion in (2.23), then σtrial is the new stress and the Newton procedure is terminated. Otherwise, the Newton increment of ∆λ is computed from:
fk Rk : Qk : ∂σ fk dλk = − , (2.54) ∂σ fk : Ξk : ∂σgk + h
h i 1 where Q = I + ∆λ : ∂2 g − , Ξ = Q : and h is a hardening parameter, which C σσ C for the von Mises model with linear hardening is equal to H (the constant hardening parameter). The stress increment is computed from: ∆σ = dλ : ∂ g R : Q , (2.55) k − kC σ k − k k after which the increment of the plastic multiplier and the stresses for the next iteration can be computed:
∆λk+1 = ∆λk + dλk, (2.56)
σk+1 = σk + ∆σk. (2.57)
The yield criterion is then evaluated again using the updated values, and the proce- dure continues until the yield criterion is satisfied to within a prescribed tolerance. Note that to start the procedure ∆λ0 = 0 and σ0 = σtrial. After convergence is achieved, the consistent tangent can be computed:
Ξ : ∂σg ∂σ f : Ξ Ctan = Ξ ⊗ , (2.58) − ∂σ f : Ξ : ∂σg + h which is used when assembling the global Jacobian (stiffness matrix). The return mapping algorithm is applied at all quadrature points. The closest-point return mapping algorithm described above (Simo and Hughes, 1998) is common to a range of plasticity models that are defined by the form of the functions f and g. The process can be generalised for models with more complicated hardening behaviour. To aid the implementation of different models, a return mapping algorithm and support for quadrature point level history parameters 46 Chapter 2. FEniCS applications to solid mechanics is provided by the FEniCS Solid Mechanics library. The library is implemented in C++ and adopts a polymorphic design, with the base class PlasticityModel providing an interface for users to implement, and thereby supply functions for f , ∂σ f , ∂σg, and ∂σσg. Figure 2.3 shows the public interface of the PlasticityModel class. Supplied with details of f (and possibly g), the library can compute stress updates and linearisations using the closest-point projection method. Computational efficiency is important in the return mapping algorithm as the stress and its linearisation are computed at all quadrature points at each global Newton iteration. Therefore, FEniCS Solid Mechanics relies on the linear algebra library Armadillo (http://arma.sourceforge.net/) to perform the block opera- tions inside the return mapping algorithm to get the benefit of BLAS. Furthermore, the algorithm is executed in C++ rather than in Python. For this reason, the FEniCS Solid Mechanics library provides a C++ interface only at this stage. To reconcile ease and efficiency, it would be possible to use just-in-time compilation for a Python implementation of the PlasticityModel interface, just as DOLFIN presently does for the Expression class (see Logg et al.(2012d)). In the following, the outline of a solver based on the FEniCS Solid Mechanics library is presented. The UFL input for a formulation in three dimensions using a continuous, piecewise quadratic basis is shown in Figure 2.4. Note that the stress and the linearised tangent, s and t, are defined using quadrature elements and supplied as coefficients to the form, line 2, 3 and 7, as they are computed inside the plasticity library. Note also in Figure 2.4 that symmetry has been exploited to flatten the stress and the tangent terms, line 13 and 18. Recall from Section 2.3 that when constitutive updates are computed outside of the form file care must be taken to ensure quadratic convergence of a Newton method. By using quadrature elements in Figure 2.4, it is possible to achieve quadratic convergence during a Newton solve for plasticity problems. The solver is implemented in C++, and Figure 2.5 shows an extract of the most relevant parts of the solver in the context of plasticity. First, the necessary function spaces are created, line 3-5. V is used to define the bilinear and linear forms and the displacement field u, while Vt and Vs are used for the two coefficient spaces: the consistent tangent and the stress, which enter the bilinear and linear forms of the plasticity problem. The forms defining the plasticity problem are then created and the relevant functions are attached to the forms, line 8-12. Then the object defining the plasticity model is created, line 25. The class VonMises is a subclass of the PlasticityModel class shown in Figure 2.3 and it implements functions for f , ∂σ f and ∂σσg. It is constructed with values for the Young’s modulus, Poisson’s ratio, yield stress and linear hardening parameter. This object can then be passed to the constructor of the PlasticityProblem class along with the forms, displacement field u, coefficient functions and boundary conditions, line 28. PlasticityProblem class, a subclass of the DOLFIN class NonlinearProblem, handles the assembly over 2.4. Implementations and examples 47
C++ code class PlasticityModel { public:
/// Constructor PlasticityModel(double E, double nu);
/// Return hardening parameter virtual double hardening_parameter(double eps_eq) const;
/// Equivalent plastic strain virtual double kappa(double eps_eq, const arma::vec& stress, double lambda_dot) const;
/// Value of yield functionf virtual double f(const arma::vec& stress, double equivalent_plastic_strain) const = 0;
/// First derivative off with respect to sigma virtual void df(arma::vec& df_dsigma, const arma::vec& stress) const = 0;
/// First derivative ofg with respect to sigma virtual void dg(arma::vec& dg_dsigma, const arma::vec& stress) const;
/// Second derivative ofg with respect to sigma virtual void ddg(arma::mat& ddg_ddsigma, const arma::vec& stress) const = 0;
};
Figure 2.3: PlasticityModel public interface defined by the plasticity library. Users are required to supply implementations for at least the pure virtual functions. These functions describe the plasticity model. 48 Chapter 2. FEniCS applications to solid mechanics
UFL code 1 element= VectorElement("Lagrange", tetrahedron, 2) 2 elementT= VectorElement("Quadrature", tetrahedron, 2, 36) 3 elementS= VectorElement("Quadrature", tetrahedron, 2, 6) 4 5 u, w= TrialFunction(element), TestFunction(element) 6 b, h= Coefficient(element), Coefficient(element) 7 t, s= Coefficient(elementT), Coefficient(elementS) 8 9 def eps(u): 10 return as_vector([u[i].dx(i) for i in range(3)]\ 11 +[u[i].dx(j)+u[j].dx(i) for i, j in [(0, 1), (0, 2), (1, 2)]]) 12 13 def sigma(s): 14 return as_matrix([[s[0], s[3], s[4]], 15 [s[3], s[1], s[5]], 16 [s[4], s[5], s[2]]]) 17 18 def tangent(t): 19 return as_matrix([[t[i*6+j] for j in range(6)] for i in range(6)]) 20 21 a= inner(dot(tangent(t), eps(u)), eps(w)) *dx 22 L= inner(sigma(s), grad(w)) *dx- dot(b, w) *dx- dot(h, w) *ds
Figure 2.4: Definition of the linear and bilinear variational forms for plasticity expressed using UFL syntax. 2.4. Implementations and examples 49 cells, loops over cell quadrature points, and variable updates in addition to defining the linear and bilinear forms of the plasticity problem. The PlasticityProblem is solved by the NewtonSolver like any other NonlinearProblem object as described earlier in this chapter, line 41. After each Newton solve, the history variables are updated by calling the update_variables function before proceeding with the next solution increment, line 44.
2.4.3 Hyperelasticity The construction of a solver for a hyperelastic problem, phrased as a minimisation problem, is now presented and follows the minimisation framework presented in Section 2.1.3. The compressible neo-Hookean model in (2.34) is adopted. The automatic functional differentiation features of UFL permit the solver code to resemble closely the abstract mathematical presentation. Differentiation of forms with respect to functions are handled by the UFL function derivative. For instance, given the potential energy functional Π (u) as a function of the displacements u, the derivative of Π with respect to u in the direction w is given by
dΠ (u + ew) DuΠ (u)[w] := , (2.59) de e=0 which can be implemented in UFL by the expression:
UFL code derivative(Pi, u, w)
If w is a test function, the result from applying the derivative is a linear form, which can be differentiated again to yield a bilinear form as shown in (2.4). Noteworthy in this approach is that it is not necessary to provide an explicit expression for the stress tensor. Changing model is therefore as simple as redefining the stored energy density function Ψ0. A complete hyperelastic solver is presented in Figure 2.6. It corresponds to a problem posed on a unit cube, and loaded by a body force b = (0, 0.5, 0), and 0 − restrained such that u = (0, 0, 0) where x = 0. Elsewhere on the boundary the traction h0 = (0.1, 0, 0) is applied. Continuous, piecewise linear functions for the displacement field are used. The code in Figure 2.6 adopts the same notation used in Sections 2.1.3 and 2.2.3. The problem is posed on the reference domain, and for convenience the subscripts ‘0’ have been dropped in the code. The solver in Figure 2.6 solves the problem using one Newton step. For problems with stronger nonlinearities, perhaps as a result of greater volumetric or surface forcing terms, it may be necessary to apply a pseudo time-stepping approach and solve the problem in number of Newton increments, or it may be necessary to apply a path following solution method. 50 Chapter 2. FEniCS applications to solid mechanics
C++ code 1 // Create mesh and define function spaces 2 UnitCube mesh(4, 4, 4); 3 Plasticity::FunctionSpace V(mesh); 4 Plasticity::BilinearForm::CoefficientSpace_t Vt(mesh); 5 Plasticity::LinearForm::CoefficientSpace_s Vs(mesh); 6 7 // Create functions, forms and attach functions 8 Function u(V); Function tangent(Vt); Function stress(Vs); 9 Plasticity::BilinearForm a(V, V); 10 Plasticity::LinearForm L(V); 11 a.t= tangent; 12 L.s= stress; 13 14 // Young’s modulus and Poisson’s ratio 15 doubleE= 20000.0; double nu= 0.3; 16 17 // Slope of hardening(linear) and hardening parameter 18 doubleE _t(0.1*E); 19 double hardening_parameter=E _t/(1.0-E _t/E); 20 21 // Yield stress 22 double yield_stress= 200.0; 23 24 // Object of class von Mises 25 fenicssolid::VonMises J2(E, nu, yield_stress, hardening_parameter); 26 27 // Create PlasticityProblem 28 fenicssolid::PlasticityProblem nonlinear_problem(a, L, u, tangent, stress, bcs, J2); 29 30 // Create nonlinear solver 31 NewtonSolver nonlinear_solver; 32 33 // Pseudo time stepping parameters 34 doublet= 0.0; double dt= 0.005; doubleT= 0.02; 35 36 // Apply load in steps 37 while (t< T) 38 { 39 // Increment time and solve nonlinear problem 40 t+= dt; 41 nonlinear_solver.solve(nonlinear_problem, *u.vector()); 42 43 // Update variables for next load step 44 nonlinear_problem.update_variables(); 45 }
Figure 2.5: DOLFIN code extract for solving a plasticity problem using the FEniCS Solid Mechanics library. 2.4. Implementations and examples 51
Python code from dolfin import *
# Optimization options for the form compiler parameters["form_compiler"]["cpp_optimize"]= True
# Create mesh and define function space mesh= UnitCube(16, 16, 16) V= VectorFunctionSpace(mesh,"Lagrange", 1)
# Define Dirichlet boundary(x=0) def left(x): return x[0]< DOLFIN _EPS bc= DirichletBC(V, Constant((0.0, 0.0, 0.0)), left)
# Define test and trial functions du, w= TrialFunction(V), TestFunction(V)
# Displacement from previous iteration u= Function(V) b= Constant((0.0,-0.5, 0.0)) # Body force per unit mass h= Constant((0.1, 0.0, 0.0)) # Traction force on the boundary
# Kinematics I= Identity(V.cell().d) # Identity tensor F=I+ grad(u) # Deformation gradient C= F.T *F # Right Cauchy-Green tensor Ic, J= tr(C), det(F) # Invariants of deformation tensors
# Elasticity parameters E, nu= 10.0, 0.3 mu, lmbda=E/(2 *(1+ nu)), E *nu/((1+ nu) *(1-2 *nu))
# Stored strain energy density(compressible neo-Hookean model) Psi= (mu/2) *(Ic- 3)- mu *ln(J)+ (lmbda/2) *(ln(J))**2
# Total potential energy Pi= Psi *dx- dot(b, u) *dx- dot(h, u) *ds
# Compute first variation of Pi(directional derivative aboutu in the direction ofw) F= derivative(Pi, u, w)
# Compute Jacobian ofF dF= derivative(F, u, du)
# Create nonlinear variational problem and solve problem= NonlinearVariationalProblem(F, u, bcs=bc, J=dF) solver= NonlinearVariationalSolver(problem); solver.solve()
Figure 2.6: Complete DOLFIN solver for the compressible neo-Hookean model, formulated as a minimisation problem. 52 Chapter 2. FEniCS applications to solid mechanics
2.4.4 Elastodynamics
As a final example, a linearised elastodynamics problem to illustrate the solution of time-dependent problems is considered. The example is based on the Newmark family of methods, which are widely used in structural dynamics. It is a direct integration method, in which the equations are evaluated at discrete points in time separated by a time increment ∆t. Thus, the time step tn+1 is equal to tn + ∆t. While this section addresses the Newmark scheme, it is straightforward to extend the approach (and implementation) to generalised-α methods (Hilber et al., 1977). The Newmark relations between displacements, velocities and accelerations at tn and tn+1 read:
1 u = u + ∆tu˙ + ∆t2 2βu¨ + 1 2β u¨ , (2.60) n+1 n n 2 n+1 − n u˙ = u˙ + ∆t γu¨ + (1 γ) u¨ , (2.61) n+1 n n+1 − n where β and γ are scalar parameters. Various well-known schemes are recovered for particular combinations of β and γ. Setting β = 1/4 and γ = 1/2 leads to the trapezoidal scheme, and setting β = 0 and γ = 1/2 leads to a central difference scheme. For β > 0, re-arranging (2.60) and using (2.61) leads to: ! 1 1 u¨ + = (u + un ∆tu˙ n) 1 u¨n, (2.62) n 1 β∆t2 n 1 − − − 2β − ! ! γ γ γ u˙ + = (u + un) 1 u˙ n ∆t 1 u¨n, (2.63) n 1 β∆t n 1 − − β − − 2β − in which un+1 is the only unknown term on the right-hand side. To solve a time dependent problem, the governing equation can be posed at time tn+1,
F (u ; w) = 0 w V, (2.64) n+1 ∀ ∈ with the expressions in (2.62) and (2.63) used for first and second time derivatives of u at time tn+1. The viscoelastic model under consideration is a minor extension of the elasticity model in (2.18). For the viscoelastic model, the stress tensor is given by: σ = 2µε + λtr(ε) + ηtr(ε˙) I, (2.65) where the constant scalar η 0 is a viscosity parameter. ≥ A simple, but complete, elastodynamics solver is presented in Figures 2.7 and 2.8. The solver mirrors the notation used in (2.62), (2.63) and (2.65), with expressions for the acceleration, velocity and displacement at time tn (a0, v0, u0), and expressions 2.5. Current and future developments 53 for the acceleration and velocity at time tn+1 (a1, v1) in terms of the displacement at tn+1 (u1) and other fields at time tn. For simplicity, the source term b = (0, 0, 0). The body is fixed such that u = (0, 0, 0) at x = 0 and the initial conditions are u0 = v0 = (0, 0, 0). A traction h is applied at x = 1 and is increased linearly from zero to one over the first five time steps. Therefore, no forces are acting on the body at t = 0 and the initial acceleration is zero. Again, the UFL functions lhs and rhs have been used to extract the bilinear and linear terms from the form. This is particularly convenient for time-dependent problems since it allows the code implementation to be posed in the same format as is usually adopted in the mathematical presentation, with the equation of interest posed in terms of fields at some point between times tn and tn+1. The presented solver could be made more efficient by exploiting linearity of the governing equation and thereby re-using the factorisation of the system matrix.
2.5 Current and future developments
In this chapter a range of standard solid mechanics problems have been presented in the context of automated modelling. The implementation of the models was shown to be relatively straightforward due to the high level of abstraction provided in the FEniCS framework. The presented cases cover a range of typical solid mechanics problems that can be solved using FEniCS version 1.0. To broaden the range of problems that can be handled in the FEniCS framework the following two extensions are of particular interest from a solid mechanics viewpoint:
Assembly of forms on manifolds In FEniCS version 1.0, it is assumed that two- dimensional elements, like triangles, are embedded in R2 and three-dimensional elements, like tetrahedra, are embedded in R3. At the time of writing, support for two-dimensional elements embedded in R3 and one-dimensional elements embedded in R2 or R3 is being implemented in the development version of FEniCS. This does, among other things, facilitate the development of support for shell and truss problems within the automated framework.
Isoparametric elements This issue relates to quadrilateral and hexahedral ele- ments, which are currently not supported, and to elements with higher order mappings that allow curved mesh boundaries to be represented.
Finally, to attract more users with a solid mechanics background another exten- sion to consider is improving the interface of the FEniCS Solid Mechanics library to make it more similar to conventional finite element software packages. This involves supplying the users with information like strain, strain rates and possibly gradients of strain at integration point level for the user to formulate the constitutive relation without working with the weak form of the governing equations. 54 Chapter 2. FEniCS applications to solid mechanics
Python code from dolfin import *
# External load class Traction(Expression): def __init__(self, end): Expression.__init__(self) self.t= 0.0 self.end= end
def eval(self, values, x): values[0]= 0.0 values[1]= 0.0 if x[0]> 1.0- DOLFIN _EPS: values[0]= self.t/self.end if self.t< self.end else 1.0
def value_shape(self): return (2,)
def update(u, u0, v0, a0, beta, gamma, dt): # Get vectors(references) u_vec, u0_vec= u.vector(), u0.vector() v0_vec, a0_vec= v0.vector(), a0.vector()
# Update acceleration and velocity a_vec= (1.0/(2.0 *beta))*( (u_vec- u0 _vec- v0 _vec*dt)/(0.5*dt*dt)- (1.0-2.0*beta)*a0_vec )
#v= dt * ((1-gamma)*a0+ gamma *a)+ v0 v_vec= dt *((1.0-gamma)*a0_vec+ gamma *a_vec)+ v0 _vec
# Update(t(n) <--t(n+1)) v0.vector()[:], a0.vector()[:]=v _vec, a_vec u0.vector()[:]= u.vector()
# Load mesh and define function space mesh= UnitSquare(32, 32)
# Define function space V= VectorFunctionSpace(mesh,"Lagrange", 1)
# Test and trial functions u1, w= TrialFunction(V), TestFunction(V)
E, nu= 10.0, 0.3 mu, lmbda=E/(2.0 *(1.0+ nu)), E *nu/((1.0+ nu) *(1.0- 2.0 *nu))
# Mass density and viscous damping coefficient rho, eta= 1.0, 0.2
Figure 2.7: DOLFIN code for solving for a dynamic problem using an implicit Newmark scheme. Program continues in Figure 2.8. 2.5. Current and future developments 55
Python code # Time stepping parameters beta, gamma= 0.25, 0.5 dt= 0.1 t, T= 0.0, 20 *dt
# Fields from previous time step(displacement, velocity, acceleration) u0, v0, a0= Function(V), Function(V), Function(V) h= Traction(T/4.0)
# Velocity and acceleration att _(n+1) v1= (gamma/(beta *dt))*(u1- u0)- (gamma/beta- 1.0) *v0- dt *(gamma/(2.0*beta) - 1.0)*a0 a1= (1.0/(beta *dt**2))*(u1- u0- dt *v0)- (1.0/(2.0 *beta)- 1.0) *a0
# Stress tensor def sigma(u, v): return 2.0*mu*sym(grad(u))+ (lmbda *tr(grad(u))+ eta*tr(grad(v)))*Identity(u.cell().d)
# Governing equation F= (rho *dot(a1, w)+ inner(sigma(u1, v1), sym(grad(w)))) *dx- dot(h, w) *ds
# Extract bilinear and linear forms a, L= lhs(F), rhs(F)
# Set up boundary condition at left end zero= Constant((0.0, 0.0)) def left(x): return x[0]< DOLFIN _EPS bc= DirichletBC(V, zero, left)
# Set up PDE, advance in time and solve u= Function(V) problem= LinearVariationalProblem(a, L, u, bcs=bc) solver= LinearVariationalSolver(problem) # Save solution in VTK format file= File("displacement.pvd") while t<=T: t+= dt h.t=t solver.solve() update(u, u0, v0, a0, beta, gamma, dt) file<
Figure 2.8: Continuation from Figure 2.7 of DOLFIN code extract for solving for a dynamic problem.
3 Representations and optimisations of finite element variational forms
The previous chapter demonstrated that solvers for various solid mechanics prob- lems can be implemented with relatively little effort using an automated modelling approach which relies on the abstractions offered by UFL and the ability of FFC to generate C++ code from the UFL input. For the approach to be competitive with hand written code, it is important that the run-time performance of the correspond- ing low-level code generated from the UFL representation is comparable to that of hand written code. To this end, FFC implements two different types of repre- sentations of finite element tensors, the so-called tensor contraction representation and the classical quadrature-loop representation, including optimisations of both representations. The development of different strategies for representing and optimising fi- nite element variational forms has been motivated by the desire of applying the automated modelling approach to problems of increasing complexity. The first representation available in FFC was the tensor contraction. However, this repre- sentation is not effective for problems like plasticity in Section 2.4.2. This led to the development of a representation based on quadrature which included the opti- misations described in Sections 3.3.1 and 3.3.2. With the availability of automatic differentiation in UFL, problems like hyperelasticity could easily be implemented in the automated framework, Sections 2.2.3 and 2.4.3. For these types of prob- lems, further optimisations of the quadrature representation was necessary for efficient computation. These optimisations are important as FFC will automatically select the quadrature representation for moderately complex and highly complex problems if the representation is not set by the user. The automatic selection of representation is discussed in Section 3.6. Many FEniCS users will, therefore, be using the quadrature representation and optimisations unknowingly, particularly if they work through the Python interface of DOLFIN. This chapter presents the developments in FFC in terms of representations and optimisations for finite element variational forms and is primarily based on the work in Ølgaard and Wells(2009, 2010, 2012b) with the main difference being that 58 Chapter 3. Representations and optimisations of finite element variational forms code examples and results have been updated to be compliant with FEniCS version 1.0. The developments have been applied by researchers and application developers to various problems such as multiphase flow through porous media (Wells et al., 2008), free surface flows (Labeur and Wells, 2009), the Navier–Stokes equations (Mortensen et al., 2011; Labeur and Wells, 2012; Jansson et al., 2011; Selim et al., 2012), fluid structure interaction (Selim, 2012; Hoffman et al., 2013), shape memory alloys (Grandi et al., 2012), electromagnetics (Marchand and Davidson, 2011; Lezar and Davidson, 2012), magnetic fluid hyperthermia for cancer therapy (Miaskowski et al., 2012), oscillatory hydraulic tomography (Saibaba et al., 2012), the Föppl–Von Kármán shell model (Vidoli, 2013), nonlinear elliptic problems (Lakkis and Pryer, 2011), microstructural processes (Maraldi et al., 2011, 2012), mantle convection simulations (Vynnytska et al., 2013, 2012), glacier ice motion (Riesen et al., 2010; Riesen, 2011), PDE-constrained optimisation and optimal control (Brandenburg et al., 2012; Funke and Farrell, 2013; Rosseel and Wells, 2012; Clason and Kunisch, 2012; Rognes and Logg, 2012), Nitsche’s method for overlapping meshes (Massing et al., 2012b,a, 2013), automated modelling of evolving discontinuities (Nikbakht and Wells, 2009; Nikbakht, 2012), liquid crystal elastomers (Luo and Calderer, 2012), and crack propagation in elastomers (Horst et al., 2013).
3.1 Motivation and approach
The tensor contraction representation of element tensors (Kirby and Logg, 2006; Ølgaard et al., 2008a) is based on the multiplicative decomposition of an element tensor into two tensors; one of which depends only on the differential equation and the chosen finite element bases and can be computed prior to run-time. It has been shown for classes of problems that the tensor contraction representation is more efficient than the traditional quadrature approach, and the speed-ups can be dramatic (Kirby and Logg, 2006; Ølgaard and Wells, 2010). Furthermore, strategies which analyse the structure of the tensor contraction representation can yield improved performance (Kirby et al., 2005, 2006). However, in contrast to the quadrature-loop approach, the tensor contraction representation is somewhat specialised as it cannot be extended trivially to non-affine isoparametric mappings while maintaining efficiency, and it is not effective for classes of nonlinear problems which require the integration of functions that do not come from a finite element space (Ølgaard et al., 2008b). The attractive feature of the approach is the run-time performance for classes of problems. A general experience is that the tensor contraction approach does not scale well when forms become more complicated. This is manifest in three ways: the time required to generate low-level code for a variational form becomes prohibitive or 3.1. Motivation and approach 59 may fail due to memory limitations or limitations of underlying libraries1; the size of the generated code is such that the compilation of the generated low-level code is prohibitively slow and file size limitations of compilers acting on the low-level code may be exceeded; and the run-time performance deteriorates rapidly relative to a quadrature approach. Complicated forms are by no means exotic. Many common nonlinear equations, when linearised, result in forms which involve numerous function products. Factors that determine the complexity of a form are the number of coefficient functions, the number of derivatives and the polynomial orders of the finite element basis functions. Approaches to reduce the time required for the code generation phase when using the tensor contraction representation have been developed and implemented in FFC (Kirby and Logg, 2007), although these cannot mitigate the inherently expensive nature of the approach for complicated forms. Using a quadrature representation for more complicated forms mitigates the problems regarding the time required to generate the code and the file size of the generated code. However, a naive implementation of the quadrature representation can have a serious impact on the run-time performance of the generated code. Fortunately, the automated generation of computer code provides scope for various optimisations to be applied such that optimal or near-optimal run-time performance is maintained also for complex forms. The optimisations that have been developed in this work are discussed in Section 3.3, see also Ølgaard and Wells(2010, 2012b). To demonstrate the issues pertinent to automated code generation for compli- cated forms this chapter presents the tensor contraction representation and the quadrature representations, and discusses four optimisation strategies for the latter for run-time performance of the generated code. Adopting the approach in Øl- gaard and Wells(2010), the two representations are then compared to each other by considering
1. The run-time performance of the generated code;
2. The size of the generated code; and
3. The speed of the code generation phase.
The relative importance of these points may well shift during a development cycle. During initial development, it is likely that the speed of the code generation phase and the size of the generated code are most important, whereas at the end of the development cycle run-time performance is likely to be the most crucial consideration. However, there is typically a correlation between the three points. After comparing the two representations, the four optimisations for the quadrature representation are compared to each other in terms of run-time performance.
1For instance, the implementation of the tensor contraction representation in FFC relies on the Python module NumPy (http://www.numpy.org/) for computations involving n-dimensional arrays. The maximum dimension which is allowed is version specific, but for NumPy version 1.6.2 nmax = 32. 60 Chapter 3. Representations and optimisations of finite element variational forms
It should be noted that the presented representations and optimisation tech- niques are possible to implement with conventional ‘hand’ coding. Automation, however, makes the approach generic and allows the application of these simple but tedious to implement by hand strategies to an unlimited range of problems. Auto- mated code generation is most appealing when considering complicated variational forms for which the strategies could not be reasonably expected of a developer to program by hand.
3.2 Representation of finite element tensors
The bilinear form for the weighted Laplace operator (w u), where u is −∇ · ∇ unknown and w is a prescribed coefficient is chosen as a canonical example to illustrate the two different representations and the optimisations implemented in FFC. The bilinear form for this operator reads Z a (u, v) := w u v dx. (3.1) Ω ∇ · ∇
The quadrature approach can deal with cases in which not all functions come from a finite element space including nonlinear functions like ln, exp, sin etc., using ‘quadrature functions’ (see Section 2.3.2) that can be evaluated directly at quadrature points. The tensor representation approach only supports cases in which all functions come from a finite element space (using interpolation if necessary). Therefore, to ensure a proper performance comparison between the representations, it is assumed in this chapter that all functions in a form, including coefficient functions, come from a finite element function space. In the case of (3.1), all functions will come from n o V := v H1 (Ω) : v P (T) T , (3.2) h ∈ |T ∈ q ∀ ∈ Th where Pq (T) denotes the space of Lagrange polynomials of degree q on the element n o T of the standard triangulation of Ω, which is denoted by . Letting φT denote Th i the local finite element basis that spans the discrete function space Vh on T, the local element tensor for an element T can be computed as Z T T AT,i = w φi φi dx, (3.3) T ∇ 1 · ∇ 2 where i = (i1, i2) is a multi-index. The UFL input for (3.1) is shown in Figure 3.1 for continuous piecewise linear functions on triangles as a basis for all functions in the form. 3.2. Representation of finite element tensors 61
UFL code element= FiniteElement("Lagrange", triangle, 1)
u= TrialFunction(element) v= TestFunction(element) w= Coefficient(element)
a=w *inner(grad(u), grad(v))*dx
Figure 3.1: UFL input for the weighted Laplacian form on linear triangular elements.
3.2.1 Quadrature representation
FFC generates an intermediate representation of the UFL input in Figure 3.1 as explained in Section 1.3.3. Assuming a standard affine mapping F : T T from T 0 → a reference element T to a given element T , this intermediate representation 0 ∈ Th reads
N n q AT,i = ∑ ∑ Φα3 (X )wα3 q=1 α3=1 d d q d q ∂Xα1 ∂Φi1 (X ) ∂Xα2 ∂Φi2 (X ) q ∑ ∑ ∑ det FT0 W , (3.4) ∂xβ ∂Xα ∂xβ ∂Xα β=1 α1=1 1 α2=1 2 where a change of variables from the reference coordinates X to the real coordinates x = FT(X) has been used. In the above equation, N denotes the number of integration points, d is the dimension of Ω, n is the number of degrees of freedom for the local basis of w, Φi denotes basis functions on the reference element, det FT0 is the determinant of the Jacobian, and Wq is the quadrature weight at integration point Xq. By default, FFC applies a quadrature scheme that will integrate the variational form exactly. From the intermediate representation in (3.4), code for computing entries of the local element tensor is generated. This code is shown in Figure 3.2. Code generated for the quadrature representation is structured in the following way. First, values of geometric quantities that depend on the current element T, like the components of the inverse of the Jacobian matrix ∂Xα1 /∂xβ and ∂Xα2 /∂xβ, are computed and assigned to the variables like K_01 in the code (this code is not shown as it is not important for understanding the nature of the quadrature representation). Next, values of basis functions and their derivatives at integration q q points on the reference element, like Φα3 (X ) and ∂Φi1 (X )/∂Xα1 are tabulated. Finite element basis functions are computed by FIAT. Basis functions and their derivatives on a reference element are independent of the current element T and 62 Chapter 3. Representations and optimisations of finite element variational forms are, therefore, tabulated at compile-time and stored in the tables Psi_w, Psi_vu_D01 and Psi_vu_D10 in Figure 3.2. After the tabulation of basis function values, the loop over integration points begins. In the example, linear elements are considered, and only one integration point is necessary for exact integration. The loop over integration points has therefore been omitted. The first task inside a loop over integration points is to compute the values of coefficients at the current integration point. For the considered problem, this involves computing the value of the coefficient w. The code for evaluating F0 in Figure 3.2 is an exact translation of the representation n Φ (Xq)w . The last part of the code in Figure 3.2 is the loop ∑α3=1 α3 α3 over the basis function indices i1 and i2, where the contribution to each entry in the local element tensor, AT, from the current integration point is added. The code presented in Figure 3.2 is the default output of the quadrature representation and is not optimised for run-time performance. Optimisation strategies are discussed in Section 3.3. To generate code using the quadrature representation the FFC command-line option -r quadrature should be used.
3.2.2 Tensor contraction representation An alternative to the run-time quadrature approach presented in the previous section is the tensor contraction representation, which is reviewed here by fol- lowing the work of Kirby and Logg(2006). Taking equation (3.4) as the point of departure, the tensor contraction representation of the element matrix for the weighted Laplacian is expressed as
d d n d Z ∂Xα1 ∂Xα2 ∂Φi1 ∂Φi2 AT,i = ∑ ∑ ∑ det FT0 wα3 ∑ Φα3 dX. (3.5) ∂xβ ∂xβ T ∂Xα ∂Xα α1=1 α2=1 α3=1 β=1 0 1 2
Noteworthy is that the integral appearing in equation (3.5) is independent of the cell geometry and can, therefore, be evaluated prior to run-time. The remaining terms, with the exception of wα3 , depend only on the geometry of the cell. Exploiting this observation, the element tensor AT,i can then be expressed as a tensor contraction,
0 α AT,i = ∑ AiαGT, (3.6) α
0 α where the tensors Aiα (the reference tensor) and GT (the geometry tensor) are defined as Z 0 ∂Φi1 ∂Φi2 Aiα = Φα3 dX, (3.7) T0 ∂Xα1 ∂Xα2 d α ∂Xα1 ∂Xα2 GT = det FT0 wα3 ∑ . (3.8) β=1 ∂xβ ∂xβ 3.2. Representation of finite element tensors 63
C++ code virtual void tabulate_tensor(double* A, const double * const * w, const ufc::cell& c) const { ... // Quadrature weight. static const double W1= 0.5;
// Tabulated basis functions at quadrature points. static const double Psi_w[1][3]=\ {{0.33333333333333, 0.33333333333333, 0.33333333333333}}; static const double Psi_vu_D01[1][3]=\ {{-1.0, 0.0, 1.0}}; static const double Psi_vu_D10[1][3]=\ {{-1.0, 1.0, 0.0}};
// Compute coefficient value. double F0= 0.0; for (unsigned intr= 0; r< 3; r++) F0+= Psi _w[0][r]*w[0][r];
// Loop basis functions. for (unsigned intj= 0; j< 3; j++) { for (unsigned intk= 0; k< 3; k++) { A[j*3+k]+= ((K_00*Psi_vu_D10[0][j]+K _10*Psi_vu_D01[0][j])* (K_00*Psi_vu_D10[0][k]+K _10*Psi_vu_D01[0][k])+ (K_01*Psi_vu_D10[0][j]+K _11*Psi_vu_D01[0][j])* (K_01*Psi_vu_D10[0][k]+K _11*Psi_vu_D01[0][k]) )*F0*W1*det; } } }
Figure 3.2: Part of the generated code for quadrature representation of the bilinear form associated with the weighted Laplacian using linear elements in two dimen- sions. The variables like K_00 are components of the inverse of the Jacobian matrix and det is the determinant of the Jacobian. The code to compute these variables is not shown. A holds the values of the local element tensor and w contains nodal values of the weighting function w. 64 Chapter 3. Representations and optimisations of finite element variational forms
During assembly, one may then iterate over all elements of the triangulation α and on each element T compute the geometry tensor GT, compute the tensor contraction (3.6) and then add the resulting element tensor AT,i to the global sparse matrix A. A generalisation of the approach to general multilinear variational forms is presented in Kirby and Logg(2007). The code which FFC will generate from the representation in (3.6) is shown in Figure 3.3. As was the case with the quadrature representation, values of geometric quantities that depend on the current element T are computed first and assigned to the variables like K_01 in the code (again, this code is not shown as it is not important for understanding the nature of the tensor contraction representation). Based on these values, the geometry tensor (3.8) is computed and the contraction in (3.6) is performed using the reference tensor from (3.7) which is precomputed during the code generation stage (the literal constants 0.166667). Notice that the contraction to compute entries in AT,i is unrolled which allows any zero-valued entry of the reference tensor to be detected during the code generation stage and the corresponding code can, therefore, be omitted. For a certain class of simple forms this can lead to a tremendous speed-up when evaluating the element matrices relative to a quadrature approach (Kirby and Logg, 2006). Inevitably, the tensor contraction approach, due to unrolling the contraction, leads to code which is much less compact compared to the quadrature represen- tation (see Figure 3.2). Furthermore, as the number of functions and derivatives present in the variational form increases, the rank of both the reference tensor and the geometry tensor increases, thereby increasing the complexity of the ten- sor contraction. Thus, for complicated forms the size of the generated code may cause problems for the compilers acting on the generated low-level code, and the complexity of the tensor contraction may exceed that of the quadrature representa- tion leading to poor run-time performance. This influence of the complexity on the performance is investigated in Section 3.4. To generate code using the tensor contraction representation the FFC command-line option -r tensor should be used.
3.3 Quadrature optimisations
The automated generation of code provides scope for employing optimisations which may not be practically feasible in hand-generated code. An example of such an approach which is pertinent to the tensor contraction representation involves the 0 analysis of the reference tensor, Aiα, in order to find so-called complexity-reducing relations between subtensors which will minimise the number of floating point operations required to compute the element tensor (Kirby et al., 2005, 2006; Kirby and Logg, 2008). For simple problems, this can lead to a significant reduction in the number of operations required to compute the local element tensor, AT,i. However, 3.3. Quadrature optimisations 65
C++ code virtual void tabulate_tensor(double* A, const double * const * w, const ufc::cell& c) const { ... // Compute geometry tensor const double G0_0_0_0= det *(w[0][0]*((K_00*K_00+K _01*K_01))); const double G0_0_0_1= det *(w[0][0]*((K_00*K_10+K _01*K_11))); const double G0_0_1_0= det *(w[0][0]*((K_10*K_00+K _11*K_01))); const double G0_0_1_1= det *(w[0][0]*((K_10*K_10+K _11*K_11))); const double G0_1_0_0= det *(w[0][1]*((K_00*K_00+K _01*K_01))); const double G0_1_0_1= det *(w[0][1]*((K_00*K_10+K _01*K_11))); const double G0_1_1_0= det *(w[0][1]*((K_10*K_00+K _11*K_01))); const double G0_1_1_1= det *(w[0][1]*((K_10*K_10+K _11*K_11))); const double G0_2_0_0= det *(w[0][2]*((K_00*K_00+K _01*K_01))); const double G0_2_0_1= det *(w[0][2]*((K_00*K_10+K _01*K_11))); const double G0_2_1_0= det *(w[0][2]*((K_10*K_00+K _11*K_01))); const double G0_2_1_1= det *(w[0][2]*((K_10*K_10+K _11*K_11)));
// Compute element tensor A[0]= 0.166667 *G0_0_0_0+ 0.166667 *G0_0_0_1+ 0.166667 *G0_0_1_0+ 0.166667*G0_0_1_1+ 0.166667 *G0_1_0_0+ 0.166667 *G0_1_0_1+ 0.166667*G0_1_1_0+ 0.166667 *G0_1_1_1+ 0.166667 *G0_2_0_0+ 0.166667*G0_2_0_1+ 0.166667 *G0_2_1_0+ 0.166667 *G0_2_1_1; A[1]=-0.166667 *G0_0_0_0- 0.166667 *G0_0_1_0- 0.166667 *G0_1_0_0- 0.166667*G0_1_1_0- 0.166667 *G0_2_0_0- 0.166667 *G0_2_1_0; A[2]=-0.166667 *G0_0_0_1- 0.166667 *G0_0_1_1- 0.166667 *G0_1_0_1- 0.166667*G0_1_1_1- 0.166667 *G0_2_0_1- 0.166667 *G0_2_1_1; A[3]=-0.166667 *G0_0_0_0- 0.166667 *G0_0_0_1- 0.166667 *G0_1_0_0- 0.166667*G0_1_0_1- 0.166667 *G0_2_0_0- 0.166667 *G0_2_0_1; A[4]= 0.166667 *G0_0_0_0+ 0.166667 *G0_1_0_0+ 0.166667 *G0_2_0_0; A[5]= 0.166667 *G0_0_0_1+ 0.166667 *G0_1_0_1+ 0.166667 *G0_2_0_1; A[6]=-0.166667 *G0_0_1_0- 0.166667 *G0_0_1_1- 0.166667 *G0_1_1_0- 0.166667*G0_1_1_1- 0.166667 *G0_2_1_0- 0.166667 *G0_2_1_1; A[7]= 0.166667 *G0_0_1_0+ 0.166667 *G0_1_1_0+ 0.166667 *G0_2_1_0; A[8]= 0.166667 *G0_0_1_1+ 0.166667 *G0_1_1_1+ 0.166667 *G0_2_1_1; }
Figure 3.3: Part of the generated code for tensor contraction representation of the bilinear form associated with the weighted Laplacian using linear elements in two dimensions. The variables like K_00 are components of the inverse of the Jacobian matrix and det is the determinant of the Jacobian. The code to compute these variables is not shown. A holds the values of the local element tensor and w contains nodal values of the weighting function w. Due to space considerations the number of digits of the literal constant 0.166667 has been reduced from fifteen to six. 66 Chapter 3. Representations and optimisations of finite element variational forms when dealing with complicated, or even moderately complicated, variational for- mulations the experience that one is not generally well-rewarded for sophisticated optimisation strategies is not uncommon. Such strategies may not scale well in terms of the required computer time to perform the optimisations for moderately complex variational forms and prove to be prohibitive in terms of time and memory. Experience indicates that simple optimisations, some of which are described in this section, offer the greatest rewards, even to the extent that the cost of evaluating element tensors becomes negligible relative to other aspects of a computation, such as insertion of entries into a sparse matrix. This section discusses four automated a priori optimisation strategies, eliminate operations on zeros, simplify expressions, precompute integration point constants and precompute basis constants, that have been developed for the quadrature representa- tion from Section 3.2.1 for improved run-time performance of the generated code. The underlying philosophy of the optimisation strategies, which are implemented in FFC, is to manipulate the representation in such a way that the number of operations to compute the local element tensor decreases. Each strategy described in the following sections, with the exception of eliminate operations on zeros, share some features which can be categorised as:
Loop invariant code motion This procedure seeks to identify terms that are inde- pendent of one or more of the summation indices and to move them outside the loop over those particular indices. For instance, in (3.4) the terms regard- q ing the coefficient w, the quadrature weight W and the determinant det FT0 are all independent of the basis function indices i1 and i2 and therefore only need to be computed once for each integration point. A generic discussion of this technique, which is also known as ‘loop hoisting’, can be found in Alfred et al.(1986).
Reuse common terms Terms that appear multiple times in an expression can be identified, computed once, stored as temporary values and then reused in all occurrences in the expression. This can have a great impact on the operation count since the expression to compute an entry in AT is located inside loops over the basis function indices as shown in the code for the standard quadrature representation in Figure 3.2.
The optimisations described in this section take place after the representation stage of the code generation process (see Figure 1.3 on page 13) where any given form is represented as simple loop and algebra instructions. Therefore, the opti- misations are general and apply to all forms and elements that can be handled by FFC. While the above optimisations are straightforward for simple forms and ele- ments, their implementation using conventional programming approaches requires manual inspection of the form and the basis. This is often done in specialised codes, but the extension to non-trivial forms is difficult, time consuming and error 3.3. Quadrature optimisations 67 prone. Furthermore, the optimised code may bear little relation to the mathematical problem at hand. This makes maintenance and re-use of the hand-generated code problematic. To switch on optimisation the command-line option -O should be used in addition to any of the FFC optimisation options presented in the following sections.
3.3.1 Eliminate operations on zeros Some basis functions, in particular those concerning mixed elements, and deriva- tives of basis functions may be zero-valued at all integration points for a particular problem. Since these values are tabulated at compile-time, the columns containing nonzero values can be identified. This enables a reduction in the loop dimension for indices concerning these tables, a process which is comparable to dead-code elimination in compiler jargon. However, a consequence of reducing the tables is that a mapping of indices must be created in order to access values correctly. The mapping results in memory not being accessed contiguously at run-time and can lead to a decrease in run-time performance. In some cases the elimination of operations on zero terms is similar to the strategy that the tensor contraction representation applies when unrolling the code as shown in Figure 3.3. The major difference being that the quadrature representation can only eliminate contributions that are zero for all quadrature points, unlike the tensor contraction representation which can eliminate all zero-valued contributions. The unrolled tensor contraction code is, however, longer which introduces some drawbacks, such as increased C++ compile-time as discussed previously. To generate code with this optimisation, the FFC command-line option -f eliminate_zeros should be used. Code for the weighted Laplace equation gener- ated with this option is shown in Figure 3.4. For brevity, only code different from the standard quadrature code in Figure 3.2 has been included. As seen in Figure 3.4, the loop dimension for the loops involving the indices j and k has decreased from three to two due to the elimination of zeros when compared to the code standard quadrature code in Figure 3.2. However, the total number of operations has increased. The reason is that the mapping causes four entries to be computed at the same time inside the loop, and the code to compute each entry has not been reduced significantly if compared to the code in Figure 3.2. In fact, using this optimisation strategy by itself is usually not recommended, but in combination with the strategies outlined in the following sections it can improve run-time performance significantly. This effect is particularly pronounced when forms contain mixed elements in which many of the values in the basis function tables are zero. Another reason for being careful when applying this strategy is that it might prevent FFC compilation due to hardware limitations because the increase in the number of entries, which is computed inside the loop, will require more memory during the compilation. 68 Chapter 3. Representations and optimisations of finite element variational forms
C++ code // Tabulated basis functions. static const double Psi_vu[1][2]= {{-1.0, 1.0}};
// Arrays of nonzero columns. static const unsigned int nzc0[2]= {0, 2}; static const unsigned int nzc1[2]= {0, 1};
// Loop basis functions. for (unsigned intj= 0; j< 2; j++) { for (unsigned intk= 0; k< 2; k++) { A[nzc0[j]*3+ nzc0[k]]+= (K_10*Psi_vu[0][j]*K_10*Psi_vu[0][k]+ K_11*Psi_vu[0][j]*K_11*Psi_vu[0][k])*F0*W1*det; A[nzc0[j]*3+ nzc1[k]]+= (K_11*Psi_vu[0][j]*K_01*Psi_vu[0][k]+ K_10*Psi_vu[0][j]*K_00*Psi_vu[0][k])*F0*W1*det; A[nzc1[j]*3+ nzc0[k]]+= (K_00*Psi_vu[0][j]*K_10*Psi_vu[0][k]+ K_01*Psi_vu[0][j]*K_11*Psi_vu[0][k])*F0*W1*det; A[nzc1[j]*3+ nzc1[k]]+= (K_01*Psi_vu[0][j]*K_01*Psi_vu[0][k]+ K_00*Psi_vu[0][j]*K_00*Psi_vu[0][k])*F0*W1*det; } }
Figure 3.4: Part of the generated code for the weighted Laplacian using linear elements in two dimensions with optimisation option -f eliminate_zeros. The arrays nzc0 and nzc1 contain the nonzero column indices for the mapping of values. Note how eliminating zeros makes it possible to replace the two tables with derivatives of basis functions Psi_vu_D01 and Psi_vu_D10 from Figure 3.2 with one table (Psi_vu). 3.3. Quadrature optimisations 69
3.3.2 Simplify expressions
The code expressions to evaluate an entry in the local element tensor can become very complex. Since such expressions are typically located inside loops, a reduction in complexity can reduce the total operation count significantly. The approach can be illustrated by the expression x(y + z) + 2xy, which after expansion of the first term and grouping common terms reduces to x(y + z) + 2xy xy + xz + 2xy → → 3xy + xz. As x appears in both products in the sum a reduction of one operation can be achieved by moving x outside parenthesis 3xy + xz x(3y + z). By applying → these simplifications, the number of operations has been reduced from five to three which may seem trivial although it is, in fact, a reduction of 40%. The algorithm developed and implemented in FFC to perform simplifications as described above, bears resemblance to the algorithm presented by Hosangadi et al.(2006) and later extended and applied to optimised code generation for finite element assembly by Russell and Kelly(2013). An additional benefit of this strategy is that the expansion of expressions, which take place before the simplification, will typically allow more terms to be precomputed and hoisted outside loops, as explained in the beginning of this section. The FFC command-line option -f simplify_expressions should be used to generate code with this optimisation enabled. Code generated by this option for the representation in (3.4) is presented in Figure 3.5, where again only code different from that in Figure 3.2 has been included. The number of operations has decreased compared to the code in Figure 3.2 for the standard quadrature representation. An improvement in run-time performance can therefore be expected. To understand how the optimisations lead to the code in Figure 3.5, consider the terms d d q d q ∂Xα1 ∂Φi1 (X ) ∂Xα2 ∂Φi2 (X ) q ∑ ∑ ∑ det FT0 W , (3.9) ∂xβ ∂Xα ∂xβ ∂Xα β=1 α1=1 1 α2=1 2 in the representation (3.4) for the weighted Laplace equation. These terms are transformed by FFC into an expression equivalent to the code
C++ code ((K_00*Psi_vu_D10[0][j]+K _10*Psi_vu_D01[0][j])* (K_00*Psi_vu_D10[0][k]+K _10*Psi_vu_D01[0][k])+ (K_01*Psi_vu_D10[0][j]+K _11*Psi_vu_D01[0][j])* (K_01*Psi_vu_D10[0][k]+K _11*Psi_vu_D01[0][k]) )*W1*det; which is, apart from a missing F0, identical to the standard quadrature code inside the loops in Figure 3.2. This expression is then expanded into a new expression, a sum of products, equivalent to the code 70 Chapter 3. Representations and optimisations of finite element variational forms
C++ code // Geometry constants. doubleG[3]; G[0]= W1 *det*(K_00*K_00+K _01*K_01); G[1]= W1 *det*(K_00*K_10+K _01*K_11); G[2]= W1 *det*(K_10*K_10+K _11*K_11);
// Integration point constants. doubleI[3]; I[0]= F0 *G[0]; I[1]= F0 *G[1]; I[2]= F0 *G[2];
// Loop basis functions. for (unsigned intj= 0; j< 3; j++) { for (unsigned intk= 0; k< 3; k++) { A[j*3+k]+= (Psi_vu_D10[0][j]*Psi_vu_D10[0][k]*I[0]+ Psi_vu_D10[0][j]*Psi_vu_D01[0][k]*I[1]+ Psi_vu_D01[0][j]*Psi_vu_D10[0][k]*I[1]+ Psi_vu_D01[0][j]*Psi_vu_D01[0][k]*I[2]); } }
Figure 3.5: Part of the generated code for the weighted Laplacian using linear elements in two dimensions with optimisation option -f simplify_expressions. 3.3. Quadrature optimisations 71
C++ code K_00*K_00*W1*det*Psi_vu_D10[0][j]*Psi_vu_D10[0][k]+ K_00*K_10*W1*det*Psi_vu_D10[0][j]*Psi_vu_D01[0][k]+ K_00*K_10*W1*det*Psi_vu_D01[0][j]*Psi_vu_D10[0][k]+ K_10*K_10*W1*det*Psi_vu_D01[0][j]*Psi_vu_D01[0][k]+ K_01*K_01*W1*det*Psi_vu_D10[0][j]*Psi_vu_D10[0][k]+ K_01*K_11*W1*det*Psi_vu_D10[0][j]*Psi_vu_D01[0][k]+ K_01*K_11*W1*det*Psi_vu_D01[0][j]*Psi_vu_D10[0][k]+ K_11*K_11*W1*det*Psi_vu_D01[0][j]*Psi_vu_D01[0][k];
In the next step of the optimisation process, identical terms depending on the loop indices j and k are identified and grouped such that the expression is equivalent to
C++ code (K_00*K_00*W1*det+K _01*K_01*W1*det)*Psi_vu_D10[0][j]*Psi_vu_D10[0][k]+ (K_00*K_10*W1*det+K _01*K_11*W1*det)*Psi_vu_D10[0][j]*Psi_vu_D01[0][k]+ (K_00*K_10*W1*det+K _01*K_11*W1*det)*Psi_vu_D01[0][j]*Psi_vu_D10[0][k]+ (K_10*K_10*W1*det+K _11*K_11*W1*det)*Psi_vu_D01[0][j]*Psi_vu_D01[0][k]; where the terms in parentheses only depend on geometry information. The terms in parentheses can, therefore, be moved outside of the loops over the basis function indices j and k and stored in the array G. During the process of generating values for G, FFC will discover that two of the four parentheses are identical and thus only three unique values in G are computed. The expressions to compute the values in G have been simplified further by moving the variables det and W1, that appear in both products, outside the parentheses as seen in Figure 3.5. The weighting coefficient F0 (left out of the detailed explanation above) will generally depend on the integration point. Therefore, each value in G is multiplied by F0 and the result is stored in the array I which contain values that are constant inside the loop over integration points. The optimisation described above is the most expensive of the quadrature optimisations to perform in terms of FFC code generation time and memory consumption as it involves creating new terms when expanding the expressions. The procedure does not scale well for complex expressions, but it is in many cases the most effective approach in terms of reducing the number of operations. This particular optimisation strategy, in combination with the elimination of zeros outlined in the previous section, was the first to be implemented in FFC. It has been investigated and compared to the tensor representation in Ølgaard and Wells (2010).
3.3.3 Precompute integration point constants The optimisations described in the previous section are performed at the expense of increased code generation time. In order to reduce the generation time while 72 Chapter 3. Representations and optimisations of finite element variational forms
C++ code // Geometry constants. doubleG[1]; G[0]= W1 *det;
// Integration point constants. doubleI[1]; I[0]= F0 *G[0];
// Loop basis functions. for (unsigned intj= 0; j< 3; j++) { for (unsigned intk= 0; k< 3; k++) { A[j*3+k]+= ((Psi_vu_D01[0][j]*K_10+ Psi _vu_D10[0][j]*K_00)* (Psi_vu_D01[0][k]*K_10+ Psi _vu_D10[0][k]*K_00)+ (Psi_vu_D01[0][j]*K_11+ Psi _vu_D10[0][j]*K_01)* (Psi_vu_D01[0][k]*K_11+ Psi _vu_D10[0][k]*K_01) )*I[0]; } }
Figure 3.6: Part of the generated code for the weighted Laplacian using linear elements in two dimensions with optimisation option -f precompute_ip_const. achieving a reduction in the operation count, another approach can be taken involving hoisting expressions that are constant with respect to integration points without expanding the expression first. To generate code with this optimisation the FFC command-line option -f precompute_ip_const should be used. Code generated by this method for the representation in (3.4) can be seen in Figure 3.6 which includes only code different from that in Figure 3.2. It is clear from the generated code that this strategy will not lead to a significant reduction in the number of operations for this particular form. The only difference between the code inside the loop in Figure 3.2 and Figure 3.6 is that F0*W1*det has been reduced to I[0] which reduces the number of operations by sixteen (two operations for each of the nine times the loop is executed minus the two operations to compute the I[0] entry). However, for more complex forms, with many coefficients, the number of terms that can be hoisted will increase significantly, leading to improved run-time performance.
3.3.4 Precompute basis constants
This optimisation strategy is an extension of the strategy described in the previous section. In addition to hoisting terms related to the geometry and the integra- 3.3. Quadrature optimisations 73
tion points, values that depend on the basis indices are precomputed inside the loops. This will result in a reduction in operations for cases in which some terms appear frequently inside the loop such that a given value can be reused once computed. To generate code with this optimisation, the FFC command-line option -f precompute_basis_const should be used. Code generated by this method for the representation in (3.4) can be seen in Figure 3.7, where only code that differs from that in Figure 3.6 has been included. Inside the loop, the value of each binary operation is stored in the array B such that it can be reused in subsequent computations. The UFL representation of (3.4), which is the input to FFC, can be viewed as a directed acyclic graph (DAG). When FFC generates code from this input, it uses algorithms from UFL to traverse the DAG such that code to evaluate subexpressions is generated before code to evaluate any expression which depends on these subexpressions. This ensures that values in B are computed in the correct order. In this particular case, no additional reduction in operations has been achieved, if compared to the previous method, since no terms can be reused inside the loop over the indices j and k. However, as the complexity of forms increases so does the scope for reusing terms inside the loop, leading to improved run-time performance.
3.3.5 Further optimisations Preliminary investigations suggest that the performance of the quadrature rep- resentation can be improved by applying two additional optimisations. Looking at the code in Figure 3.7, it is seen that about half of the temporary values in the array B only depend on the loop index j, and they can therefore be hoisted, as has been done for other terms in previous sections. Another approach is to unroll the loops with respect to j and k in the generated code. This will lead to a dramatic increase in the number of values that can be reused, and the approach can be readily combined with all of the other optimisation strategies. However, the total number of temporary values will also increase. Therefore, this optimisation strategy might not be feasible for all forms. FFC implements a few efficient quadrature schemes for integrating polynomi- als of degree less than or equal to six on simplices. For polynomials of degree higher than six, it calls FIAT to compute the quadrature scheme. FIAT supplies schemes that are based on the Gauss–Legendre–Jacobi rule mapped onto simplices (see Karniadakis and Sherwin(2005) for details of such schemes). This means that for integrating a seventh-order polynomial, FFC will use four quadrature points in each spatial direction, that is, 43 = 64 points per cell in three dimensions. A further optimisation of the quadrature representation can thus be achieved by implementing more efficient quadrature schemes for higher order polynomials on simplices since a reduction in the number of integration points will yield improved run-time performance. FFC does, however, provide an option for a user to specify 74 Chapter 3. Representations and optimisations of finite element variational forms
C++ code for (unsigned intj= 0; j< 3; j++) { for (unsigned intk= 0; k< 3; k++) { doubleB[16]; B[0]= Psi _vu_D01[0][j]*K_10; B[1]= Psi _vu_D10[0][j]*K_00; B[2]= (B[0]+B[1]); B[3]= Psi _vu_D01[0][k]*K_10; B[4]= Psi _vu_D10[0][k]*K_00; B[5]= (B[3]+B[4]); B[6]=B[2] *B[5]; B[7]= Psi _vu_D01[0][j]*K_11; B[8]= Psi _vu_D10[0][j]*K_01; B[9]= (B[7]+B[8]); B[10]= Psi _vu_D01[0][k]*K_11; B[11]= Psi _vu_D10[0][k]*K_01; B[12]= (B[10]+B[11]); B[13]=B[12] *B[9]; B[14]= (B[13]+B[6]); B[15]=B[14] *I[0]; A[j*3+k]+=B[15]; } }
Figure 3.7: Part of the generated code for the weighted Laplacian using linear elements in two dimensions with optimisation option -f precompute_basis_const. The array B contain precomputed values that depend on indices j and k. 3.4. Performance comparisons of representations 75
the quadrature degree of a variational form thereby permitting inexact quadrature. For instance, to set the quadrature degree equal to two, the command-line option -f quadrature_degree=2 should be used in which case FFC will use a quadrature rule which is able to integrate a quadratic polynomial exactly. For tetrahedra, this will result in a four point quadrature scheme.
3.4 Performance comparisons of representations
Generated tensor contraction and quadrature-based code is now compared in terms of the metrics outlined in Section 3.1, namely the run-time performance, the size of generated code and the speed of the code generation phase. The aim is to elucidate features of the two representations for various problems with the goal of finding a guiding principle for selecting the most appropriate representation for a given problem. First some typical forms of differing complexity and nature are considered to illustrate some trends and differences between the representations. This leads to a systematic comparison using some very simple forms for which the tensor contraction representation is expected to prove superior, before increasing the complexity of the forms in order to investigate the cross-over point at which the quadrature representation becomes the better representation in terms of run-time performance. Exact quadrature is used for all examples. All tests were performed on an Intel Core i7-2600 CPU at 3.40GHz (8 cores, although tests were run in serial) with 15.7GB of RAM running Ubuntu 12.10 with Linux kernel 3.5.0-23. Python version 2.7.3 and NumPy version 1.6.2 (both pertinent to FFC) is used when generating code, while g++ version 4.7.2 with the ‘-O2 - funroll-loops’ optimisation flags is used to compile the generated C++ code which is compliant with UFC version 2.0.5. DOLFIN version 1.0.0 is used to assemble the global sparse matrix for tests which involve compressed sparse matrices. DOLFIN provides various linear algebra backends, and PETSc (Balay et al., 2001) is used as the backend for the assembly tests. The nonzero structure of the compressed sparse matrix is initialised and no special reordering of degrees of freedom has been used in the assembly tests. Results presented in this section is obtained with FFC version 1.0.0 using the optimisation options -f eliminate_zeros and -f simplify for the quadrature representation.
3.4.1 Performance for a selection of forms The two representations are now compared for three different ‘real’ forms to demonstrate the strengths and weaknesses. The first form considered is a mixed Poisson formulation using fifth-order Brezzi–Douglas–Marini (BDM) elements (Brezzi et al., 1985), automation aspects of which have been addressed by Rognes et al.(2010). The bilinear form, which leads to the finite element stiffness matrix, 76 Chapter 3. Representations and optimisations of finite element variational forms
UFL code BDM= FiniteElement("Brezzi-Douglas-Marini", triangle, 5) DG= FiniteElement("Discontinuous Lagrange", triangle, 5- 1)
mixed_element= BDM *DG
(sigma, u)= TrialFunctions(mixed _element) (tau, w)= TestFunctions(mixed _element)
a=(dot(sigma, tau)-u *div(tau)+ div(sigma) *w)*dx
Figure 3.8: UFL code for the stiffness matrix of the mixed Poisson problem in (3.10) using BDM elements of order five. for the mixed Poisson problem reads Z a(σ, u; τ, w) := σ τ u ( τ) + ( σ) w dx, (3.10) Ω · − ∇ · ∇ · where τ, σ V, w, u W and ∈ ∈ V := τ H (div, Ω) : τ BDM (T) T , (3.11) ∈ |T ∈ k ∀ ∈ Th n 2 o W := w L (Ω) : w T Pk 1 (T) T h . (3.12) ∈ | ∈ − ∀ ∈ T The UFL code for this form with k = 5 is shown in Figure 3.8. The generation of code for a discontinuous Galerkin formulation of the bihar- monic equation with Lagrange basis functions which involves both cell and interior facet integrals (Ølgaard et al., 2008a) is also considered. The bilinear form for this problem reads
Z Z Z a (u, v) := 2u 2v dx 2u v ds u 2v ds Ω ∇ ∇ − Γ0 h∇ i · J∇ K − Γ0 J∇ K · h∇ i Z α + u v ds, (3.13) Γ0 h J∇ K · J∇ K where the functions u, v V and ∈ n o V := v H1 (Ω) : v P (T) T , (3.14) ∈ 0 T ∈ k ∀ ∈ Th and Γ0 denotes the set of interior facets, α > 0 is a penalty parameter and h is a measure of the cell size. See Section 4.2.4 for more details. The UFL code for this bilinear form for the case k = 3 is shown in Figure 3.9. The third example is a complicated form which has arisen in modelling temperature-dependent 3.4. Performance comparisons of representations 77
UFL code element= FiniteElement("Lagrange", triangle, 3) u= TrialFunction(element) v= TestFunction(element)
n= VectorConstant(element.cell()) h= Constant(element.cell()) h_avg= 0.5 *(h(’+’)+ h(’-’))
alpha= 10.0
a= inner(div(grad(u)), div(grad(v))) *dx \ - inner(avg(div(grad(u))), jump(grad(v), n))*dS \ - inner(jump(grad(u), n), avg(div(grad(v))))*dS \ + alpha*h_avg*inner(jump(grad(u), n), jump(grad(v),n))*dS
Figure 3.9: UFL code for the stiffness matrix of a discontinuous Galerkin for- mulation for the biharmonic equation using two-dimensional elements of order three (3.13). multiphase flow through porous media (Wells et al., 2008). It comes from the approximate linearisation of a stabilised finite element formulation for a particular problem and is characterised by standard Lagrange basis functions of low order but the products of many functions from a number of different spaces. The physical significance of the equation is unimportant in the context of this work, therefore it is presented in an abstract form. The bilinear form reads:
2 ! Z a(p, q) := f0g2g3g4 pq (1 g5) ∑ giui p q Ω − − i=0 · ∇ 2 ! 2 ! g6(1 g5) ∑ f2i+1 p q + f0g2g3g4 p g7 ∑ giui q − − i=0 ∇ · ∇ i=0 · ∇ 2 ! 2 ! (1 g5) ∑ giui p g7 ∑ giui q − − i=0 · ∇ i=0 · ∇ 2 ! 2 ! 2 g6(1 g5) ∑ f2i+1 p g7 ∑ giui q dx, (3.15) − − i=0 ∇ i=0 · ∇ where the test and trial functions q, p V with ∈ n o V := v H1 (Ω) : v P (T) T , (3.16) ∈ T ∈ 2 ∀ ∈ Th and the functions f V , g V and u V are coefficient functions. The i ∈ f i ∈ g i ∈ u 78 Chapter 3. Representations and optimisations of finite element variational forms
UFL code scalar_p= FiniteElement("Lagrange", triangle, 2) scalar= FiniteElement("Lagrange", triangle, 1) dscalar= FiniteElement("Discontinuous Lagrange", triangle, 0) vector= VectorElement("Discontinuous Lagrange", triangle, 1)
p= TrialFunction(scalar _p) q= TestFunction(scalar _p)
f0, f1, f2, f3, f4, f5, f6=[Coefficient(scalar) for i in range(7)] g0, g1, g2, g3, g4, g5, g6, g7=[Coefficient(dscalar) for i in range(8)] u0, u1, u2=[Coefficient(vector) for i in range(3)]
Sgu= g0 *u0+ g1 *u1+ g2 *u2 S= g6 *(1- g5) *(f1+ f3+ f5)
a_0=p *g3*f0*g2*g4*q\ - (1- g5) *inner(Sgu, grad(p))*q\ -S *inner(grad(p), grad(q))
a_1= g3 *f0*g2*g4*p*g7*inner(Sgu, grad(q))\ - (1- g5) *inner(Sgu, grad(p))*g7*inner(Sgu, grad(q))\ +S *div(grad(p))*g7*inner(Sgu, grad(q))
a= (a _0+a _1)*dx
Figure 3.10: UFL code for the ‘pressure equation’ (3.15) in two dimensions. coefficients spaces are: n o V := f H1 (Ω) : f P (T) T , (3.17) f ∈ T ∈ 1 ∀ ∈ Th n o V := g L2 (Ω) : g P (T) T , (3.18) g ∈ T ∈ 1 ∀ ∈ Th 2 2 V := u L2 (Ω) : u P (T) T . (3.19) u ∈ T ∈ 1 ∀ ∈ Th
The coefficient functions are either prescribed or come from the solution of other equations. The UFL input to the compiler for this form is shown in Figure 3.10. Due to the origins of this form, it will informally be denoted as the ‘pressure equation’. The three forms have been compiled with FFC using the tensor contraction and quadrature representations. In Table 3.1, the time required to generate the code, the size of the generated code and the time required to compile the C++ code are reported for each form. Results are presented for the tensor contraction case, together with the ratio of the time/size for the quadrature representation case divided by the time/size required for the tensor contraction representation case, denoted by q/t. In measuring the C++ compile-time and the run-time performance, 3.4. Performance comparisons of representations 79
Form generation [s] q/t size [kB] q/t C++ [s] q/t mixed Poisson 6.3 0.79 4300 0.91 27.2 0.11 DG biharmonic 23.4 0.04 4800 0.07 77.1 0.06 pressure equation 4.0 0.14 5300 0.05 356.0 0.01
Table 3.1: Timings and code size for the compilation phase for the various variational forms. ‘generation’ is the time required by FFC to generate the tensor contraction code; ‘size’ is the size of the generated tensor contraction code; and ‘C++’ is the time required to compile the generated C++ code. The ratio q/t is the ratio between quadrature and tensor contraction representations.
Form flops q/t run-time [s] q/t mixed Poisson 38138 34.26 11.7 17.600 DG biharmonic 37353 1.41 15.3 1.175 pressure equation 271356 0.04 158.8 0.014
Table 3.2: Run-time performance for the various variational forms. the generated code has been compiled against the library DOLFIN. Noteworthy from the results in Table 3.1 is that the generation phase for the quadrature repre- sentation is faster than the tensor contraction representation generation phase for all forms. In all cases the size of the generated quadrature code is smaller than the tensor contraction code, which is reflected in the C++ compile-time. The differences in the C++ compile-time are substantial for all forms (more than a factor of hundred for the pressure equation), which is important during the code development phase with frequent recompilations.2 Timings and operation counts for the three forms are presented in Table 3.2. The number of floating point operations (flops) is defined as the sum of all ‘+’ and ‘ ’ ∗ operators in the code for computing the element matrix. Although multiplications are generally more expensive than additions, this definition provides a good measure for the performance of the generated code. The compound operator ‘+=’ is counted as one operation. For the run-time performance, the time required to compute the local element tensors N times is recorded. The time needed to insert the local tensor into the global sparse matrix is not included. For the mixed Poisson problem N = 5 105 and for the discontinuous Galerkin biharmonic × 2It should be noted that the C++ compile-time reduces substantially for the tensor contraction representation if no g++ optimisations are used (approximately around a factor of ten). The C++ compile-time for the quadrature representation is typically a couple of seconds irrespective of which g++ optimisation option is used. 80 Chapter 3. Representations and optimisations of finite element variational forms
UFL code element= FiniteElement("Lagrange", tetrahedron, 2)
u= TrialFunction(element) v= TestFunction(element)
a=u *v*dx
Figure 3.11: UFL code for the mass matrix in three dimensions with element order q = 2.
problem and the pressure equation N = 1 106. Table 3.2 presents the timings and × operation counts for tensor contraction representation, together with the ratio of the quadrature representation case and the tensor contraction representation case, q/t. The run-time performance is indicative of an aspect of the two representations; there can be significant performance difference depending on the nature of the differential equation. For the mixed Poisson problem, the tensor contraction representation is close to a factor of twenty faster than the quadrature representation, whereas for the pressure equation the quadrature representation is close to a factor of seventy faster than the tensor contraction case. Furthermore, the run-time performance ratio and the flops ratio are in the same order of magnitude suggesting a coupling between the two. This observation of dramatic differences in run-time performance suggests the possibility of devising a strategy for determining the best representation, without generating the code for each case. Such concepts have been successfully developed in digital signal processing (Püschel et al., 2005). For forms with a relatively simple structure, devising such a scheme is straightforward. However, it turns out to be non-trivial for arbitrary forms.
3.4.2 Performance for common, simple forms The performance of the two representations for two canonical examples: the scalar ‘mass’ matrix and the ‘elasticity-like’ stiffness matrix is now investigated. The input for the mass matrix form is shown in Figure 3.11 and the input for the elasticity-like stiffness matrix is shown in Figure 3.12. The performance of the two representations are compared for three-dimensional cases on simplices and for various polynomial orders. Code is generated using FFC, and the number of floating point operations required to form the element matrix for all cases is reported. In addition to reporting the number of floating point operations, the time required to compute the element matrix N times is also presented, which is expected in most cases to be strongly correlated to the floating point operations count. As before, values are reported for the tensor contraction representation case together with the ratio of the quadrature value over the tensor contraction value. 3.4. Performance comparisons of representations 81
UFL code element= VectorElement("Lagrange", tetrahedron, 3)
u= TrialFunction(element) v= TestFunction(element)
def eps(v): return grad(v)+ grad(v).T
a= 0.25 *inner(eps(u), eps(v))*dx
Figure 3.12: UFL code for the elasticity-like matrix in three dimensions with element order q = 3.
The time required for insertion into a sparse matrix, which is independent of the element matrix representation, is also reported. The total assembly time is the ‘run-time’ plus the ‘insertion’ time, which provides a picture of the overall assembly performance. The ratio of the total assembly time for the quadrature representation over the total assembly time for the tensor contraction representation, denoted by aq/at, is also presented. When taking this into account, for some forms the difference in performance between different representations appears less drastic. The various timings for the mass matrix problem are reported in Table 3.3. What is clear from these results is that tremendous speed-ups for computing the element matrices can be achieved using the tensor contraction representation, particularly as the element order is increased. This is perhaps not surprising considering that the geometry tensor for this case is simply a scalar, therefore the entire matrix is essentially precomputed. Also note that the g++ compiler appears to be performing particularly well for the tensor contraction representation in the two cases where q = 2 and q = 3. For the case q = 3, the ratio of flops suggest that the run-time ratio should be around hundred while in fact it is close to 6500. However, as the number of flops increase for the tensor contraction representation this effect disappears and the two ratios become almost equal (compare 365 to 378 for the case q = 4). The effect of the speed-up of computing the element matrix is reduced, however, if the time required to insert terms into a sparse matrix is taken into account. For the case of q = 4, the tensor contraction representation is a factor of 378 faster for computing the element matrix, but when insertion is included an overall speed-up factor of 9.72 is observed. Although this is a substantial speed-up, the efficiency of matrix insertion must be addressed to reap the full benefits of the tensor contraction approach for these types of problems. If in addition the time required to perform the remaining parts of the finite element procedure such as mesh initialisation, application of boundary conditions, and solving the resulting system of equations is taken into account the q/t ratio will become even closer to unity. 82 Chapter 3. Representations and optimisations of finite element variational forms
flops q/t run-time [s] q/t insertion [s] aq/at q = 1 (N = 1 109) 52 3.8 1.05 4.1 21.4 1.03 × q = 2 (N = 1 108) 136 31.0 0.11 764.3 67.1 2.25 × q = 3 (N = 1 108) 316 91.2 0.12 6493.3 362.3 3.15 × q = 4 (N = 1 107) 1260 364.7 3.40 377.9 143.5 9.72 ×
Table 3.3: Timings for the mass matrix in three dimensions for varying polynomial order basis q.
flops q/t run-time [s] q/t insertion [s] aq/at q = 1 (N = 1 107) 2242 0.6 2.47 1.4 10.17 1.09 × q = 2 (N = 1 106) 18046 2.7 4.79 3.2 9.68 1.74 × q = 3 (N = 1 105) 91522 9.5 2.63 10.5 5.08 4.24 × q = 4 (N = 1 104) 321984 16.3 1.13 13.7 1.86 5.79 ×
Table 3.4: Timings for the elasticity-like matrix in three dimensions for varying polynomial order basis q.
The various timings for the elasticity-like stiffness matrix are presented in Table 3.4. Compared to the mass matrix, the differences in performance of the tensor contraction representation relative to quadrature representation are less dramatic, but nonetheless substantial, especially for higher-order functions.
3.4.3 Performance for forms of increasing complexity The complexity of the forms investigated in the previous section is now increased systematically in order to examine under which circumstances the quadrature representation will be more favourable in terms of run-time performance. The comparison is based on the floating point operation count3 and the size of the generated file for a large class of problems. The ‘complexity’ of a variational form is considered to increase when the number of function products increases and when the number of derivatives present increases. Increasing the number of derivatives and/or the numbers of functions appearing in a form leads to higher rank tensors for the tensor contraction representation. Also, increases in the polynomial order of the basis of a coefficient function leads to an increase in complexity of the geometry α tensor GT while increases in the polynomial order of the basis of test and trial 0 functions lead to an increase in complexity of the reference tensor Aiα, see (3.8) and (3.7). Initially, attention is restricted to manipulating the number of function
3While the tables concerning flops and run-time performance in the previous two sections suggest that the flop count is a reasonably good indicator of performance, it is demonstrated in Section 3.5 that this is not always the case. 3.4. Performance comparisons of representations 83
UFL code element= FiniteElement("Lagrange", tetrahedron, 2) element_f= FiniteElement("Lagrange", tetrahedron, 3)
u= TrialFunction(element) v= TestFunction(element)
f= Coefficient(element _f) g= Coefficient(element _f)
a=f *g*u*v*dx
Figure 3.13: UFL code for the mass matrix in three dimensions with with q = 2, premultiplied by two coefficient functions (n f = 2) of order p = 3. multiplications in the forms and the polynomial order of these functions, before introducing products of derivatives. To generate forms of greater complexity than those in the previous section, the mass matrix and elasticity-like variational forms with a Lagrange basis of order q are premultiplied with n f functions of order p. In case of the mass matrix, the modified form reads: Z n f a (u, v) := ∏ fi uv dx, (3.20) Ω i=1 where the test and trial functions v, u V with ∈ n o V := v H1 (Ω) : v P (T) T , (3.21) ∈ T ∈ q ∀ ∈ Th and f V are coefficient functions with i ∈ f n o V := v H1 (Ω) : v P (T) T . (3.22) f ∈ T ∈ p ∀ ∈ Th An example of UFL code is shown in Figure 3.13 for the mass matrix pre-multiplied by coefficient functions where q = 2, n f = 2 and p = 3. A comparison of the representations for the mass matrix with a different number of premultiplying functions and a range of orders p and q are presented in Table 3.5. In terms of flops, a ratio q/t > 1 indicates that the tensor representation is more efficient while q/t < 1 indicates that the quadrature representation is more efficient. What is clear from Table 3.5 is that with few premultiplying functions, the tensor contraction approach is generally more efficient, even for relatively high order premultiplying functions. The situation changes quite dramatically as the 84 Chapter 3. Representations and optimisations of finite element variational forms
n f = 1 n f = 2 n f = 3 n f = 4 flops q/t flops q/t flops q/t flops q/t p = 1, q = 1 156 1.86 580 1.61 2324 0.49 9492 0.21 p = 1, q = 2 648 7.18 3136 2.44 12512 1.68 52416 0.80 p = 1, q = 3 2700 28.68 12484 12.21 46628 3.29 205716 1.30 p = 1, q = 4 7994 57.62 38058 20.97 155850 5.13 622970 2.04 p = 2, q = 1 360 2.72 3472 0.63 36020 0.39 370020 0.08 p = 2, q = 2 1884 4.10 20236 2.12 203926 0.39 2044176 0.06 p = 2, q = 3 7656 19.95 79936 3.36 766628 0.57 8049636 0.08 p = 2, q = 4 23330 34.23 239550 5.32 2452810 0.78 24548810 0.11 p = 3, q = 1 700 1.93 14020 1.17 288020 0.13 5920020 0.02 p = 3, q = 2 3808 5.75 81136 1.02 1572608 0.09 FFC stopped p = 3, q = 3 14740 10.53 315652 1.39 6380156 0.11 -- p = 3, q = 4 47850 16.78 980010 1.96 19602234 0.14 --
Table 3.5: The number of operations and the ratio between number of operations for the two representations for the mass matrix in three dimensions as a function of different polynomial orders and numbers of functions. number of premultiplying functions increases, and as the polynomial order of the premultiplying functions increases. The cases with numerous premultiplying functions are typical of the Jacobian resulting from the linearisation of a nonlinear differential equation in a practical simulation, and are therefore important. It is also noted that the tensor contraction representation is more efficient for increases in q, however, this effect is less pronounced for the cases where n f > 1 and p > 1. Obviously, the selection of the representation can have a tremendous performance impact. For the most complicated cases where n f = 4, p = 3 and q > 1 FFC was stopped after more than one hour of generating code for the tensor contraction representation. FFC generated the quadrature representation code for all cases in a couple of seconds. Interestingly, for complicated forms the operation count is not always a good indicator of performance. For the three-dimensional mass matrix case with p = 1, q = 4 and n f = 4, it would be expected from the operation count (q/t = 2.04) that the tensor contraction representation would be faster. However, when computing the element tensor 100000 times, a ratio of q/t = 0.81 is observed, meaning that the quadrature representation is faster. Noteworthy for this case is that the size of the generated code for tensor contraction representation is 13 MB, while the size of the generated quadrature code is only 2.4 MB. This size difference leads not only to a significant difference in the C++ compile-time (almost twenty minutes for the tensor contraction code and only two seconds for the quadrature code), but also appears to result in a drop in run-time performance. The performance drop 3.4. Performance comparisons of representations 85
n f = 1 n f = 2 n f = 3 flops q/t flops q/t flops q/t p = 1, q = 1 9928 0.13 42832 0.11 183088 0.03 p = 1, q = 2 80020 0.75 331228 0.51 1154620 0.16 p = 1, q = 3 405064 2.31 1466704 1.02 6806512 0.59 p = 1, q = 4 1426374 9.82 5920974 4.60 23425902 1.17 p = 2, q = 1 24940 0.19 268120 0.06 2758888 0.01 p = 2, q = 2 204760 0.82 2071972 0.14 21617452 0.07 p = 2, q = 3 902188 1.66 10789336 0.72 FFC stopped p = 2, q = 4 3680298 7.43 37846422 1.25 -- p = 3, q = 1 19936 0.29 750880 0.04 21556504 0.01 p = 3, q = 2 367732 0.49 8611804 0.18 FFC stopped p = 3, q = 3 2068552 1.93 43364368 0.31 -- p = 3, q = 4 7366950 3.71 152974350 0.50 --
Table 3.6: The number of operations and the ratio between number of operations for the two representations for the elasticity-like tensor in three dimensions as a function of different polynomial orders and numbers of functions. could be attributed to the increased memory traffic noted by Kirby and Logg(2006). Also, it may be that the compiler is unable to perform effective optimisations on the unrolled code, or that the compiler is particularly effective at optimising the loops in the generated quadrature code. A similar comparison is made for elasticity-like forms and the results are presented in Table 3.6. The trends in this table are similar to those observed for the mass matrix. Again, FFC was stopped after one hour of generating code for a number of the more complex forms when using the tensor contraction representation. Code generation using the quadrature representation completes in a few seconds for all cases. Compared to the mass matrix case, the number of operations has increased significantly which has a big impact on both the FFC generation time and the size of the generated code. As an example, FFC spent 63 minutes generating a file of 2.8 GB for the case where n f = 2, p = 3 and q = 4 for the tensor contraction representation. For the quadrature representation the code was generated in 8.6 seconds and the resulting file size was 9.2 MB.
As seen in Table 3.6, increasing the number of coefficient functions n f in the form clearly works in favor of quadrature representation. For n f = 3 the quadrature representation can be expected to perform best for all values of q and p even though q/t = 1.17 for the case where p = 1 and q = 4. In this specific case the size of the generated code for the tensor contraction representation is 442 MB which will reduce the run-time performance as discussed previously, assuming that g++ is able to compile the code at all. Increasing the polynomial order of the coefficients, 86 Chapter 3. Representations and optimisations of finite element variational forms
UFL code element= VectorElement("Lagrange", triangle, 2) element_f= VectorElement("Lagrange", triangle, 3)
u= TrialFunction(element) v= TestFunction(element)
f= Coefficient(element _f) g= Coefficient(element _f)
a= div(f) *div(g)*inner(grad(u), grad(v))*dx
Figure 3.14: UFL code for the vector-valued Poisson problem in two dimension with with q = 2, premultiplied by the divergence of two vector valued functions (n f = 2) of order p = 3. p, also works in favor of quadrature representation although the effect is less pronounced compared to the effect of increasing the number of coefficients. The tensor representation appears to perform better when the polynomial order of the test and trial functions, q, is increased although the effect is most pronounced when the number of coefficients is low. However, file size considerations, will rule out the tensor contraction representation for a number of forms where, based on the ratio, it would be expected to outperform the quadrature representation. It is more difficult in these cases to make broad generalisation as to the best representation. This again suggests that a method for automatically determining the best representation based on inspection of the form may be interesting. A discussion of such a strategy is, however, postponed until Section 3.6. Finally, the influence of premultiplying a vector-valued Poisson variational form by the divergence of vector-valued functions is investigated. The UFL code for the case n f = 2, p = 3 and q = 2 is shown in Figure 3.14. A comparison of tensor contraction and quadrature representations is performed, as in the previous cases, and the results are shown in Table 3.7. Premultiplying forms with derivatives of functions clearly increases the complexity to such a degree that the tensor contraction representation involves fewer operations for only a very limited number of the considered cases.
3.5 Performance comparisons of quadrature optimisations
In this section the impact of the optimisation strategies, outlined in Section 3.3, on the run-time performance is investigated. The point is not to present a rigorous analysis of the optimisations, but to provide indications as to when the different strategies will be most effective. The performance of the quadrature optimisations 3.5. Performance comparisons of quadrature optimisations 87
n f = 1 n f = 2 flops q/t flops q/t p = 1, q = 1 708 0.29 6148 0.07 p = 1, q = 2 2202 0.90 18394 0.13 p = 1, q = 3 8090 1.48 66394 0.19 p = 1, q = 4 22548 2.53 183892 0.32 p = 2, q = 1 1412 0.16 24580 0.04 p = 2, q = 2 7790 0.52 162766 0.03 p = 2, q = 3 24902 0.57 516606 0.05 p = 2, q = 4 60156 1.27 1246436 0.10 p = 3, q = 1 2116 0.30 96772 0.02 p = 3, q = 2 11862 0.36 545422 0.02 p = 3, q = 3 45086 0.54 1695358 0.03 p = 3, q = 4 110668 1.08 4093924 0.04
Table 3.7: The number of operations and the ratio between number of operations for the two representations for the vector-valued Poisson problem in two dimensions as a function of different polynomial orders and numbers of functions.
will be investigated using two forms, namely the bilinear form for the weighted Laplace equation (3.1), see UFL input in Figure 3.1, and the bilinear form for the Mooney–Rivlin hyperelasticity model from (2.33), page 36, in three dimensions. The UFL input for the hyperelasticity model is seen in Figure 3.15. In both cases quadratic Lagrange finite elements will be used. All tests were performed using the same hardware and software setup as de- scribed in the previous section with the small difference that the g++ compiler options are varied. The two forms are compiled with the different FFC optimi- sations, and the number of floating point operations (flops) to compute the local element tensor is determined. The number of flops is defined as in the previous section, that is, as the sum of all appearances of the operators ‘+’ and ‘*’ in the code. The ratio between the number of flops of the current FFC optimisation and the standard quadrature representation, o/q is also computed. The gener- ated code is then compiled with g++ using four different optimisation options for g++, and the time needed to compute the element tensor N times is measured. In the following, -zeros will be used as shorthand for the -f eliminate_zeros option, -simplify is shorthand for the -f simplify_expressions option, -ip is shorthand for the -f precompute_ip_const option and -basis is shorthand for the -f precompute_basis_const option. The operation counts for the weighted Laplace equation with different FFC optimisations can be seen in Table 3.8, while Figure 3.16 shows the run-time performance for different compiler options for N = 5 107. The FFC compiler × 88 Chapter 3. Representations and optimisations of finite element variational forms
UFL code element= VectorElement("Lagrange", tetrahedron, 2)
w= TestFunction(element) du= TrialFunction(element) u= Coefficient(element) c1= Constant(tetrahedron) c2= Constant(tetrahedron)
I= Identity(3) # Identity tensor F=I+ grad(u) # Deformation gradient C= F.T *F # Right Cauchy--Green tensor
I_C= tr(C) # First invariant ofC II_C= (I _C**2- tr(C *C))/2.0 # Second invariant ofC
# Stored strain energy density(Mooney--Rivlin model) Psi= c1 *(I_C- 3.0)+ c2 *(II_C- 3.0)
Pi= Psi *dx # Potential energy F= derivative(Pi, u, w) # First variation of Pi aboutu in directionw J= derivative(F, u, du) # Jacobian
Figure 3.15: UFL input for the Mooney–Rivlin hyperelasticity model in three dimensions using quadratic elements. It is the bilinear form, the Jacobian J, which is of interest in the performance comparison. 3.5. Performance comparisons of quadrature optimisations 89
FFC optimisation flops o/q None 4176 1.00 -zeros 6672 1.60 -simplify 2712 0.65 -simplify -zeros 1920 0.46 -ip 3756 0.90 -ip -zeros 4290 1.03 -basis 3756 0.90 -basis -zeros 3690 0.88
Table 3.8: Operation counts for the weighted Laplace equation.
options can be seen on the x-axis in the figure and the four g++ compiler options are shown with different colors. The FFC and g++ compile-times were less than one second for all optimisation options. It is clear from Figure 3.16 that run-time performance is greatly influenced by the g++ optimisations. Compared to the case where no g++ optimisations are used (the -O0 flag), the run-time for the standard quadrature code improves by a factor of 4.70 when using the -O2 option, 6.86 when using the -O2 -funroll-loops option and 10.65 when using the -O3 option. The -O3 option does not appear to improve the run-time noticeably beyond the improvement observed for the -O2 -funroll-loops option when the FFC optimisation option -zeros is used. Using the FFC optimisation option -zeros alone for this form does not improve run- time performance. In fact, using this option in combination with any of the other optimisation options increases the run-time, even when combining with the option -simplify, which has a significant lower operation count compared to the standard quadrature representation. A curious point to note is that without g++ optimisation there is a significant difference in run-time for the -ip and -basis options, even though they involve the same number of flops. When g++ optimisations are switched on, this difference is eliminated completely and the run-times for the two FFC optimisations are identical. This suggests that it is not possible to predict run- time performance from the operation count alone since the type of FFC optimisation must be taken into account as well as the intended use of g++ compiler options. The optimal combination of optimisations for this form is FFC option -ip or -basis combined with g++ option -O2 -funroll-loops, in which case the run-time has improved by a factor of 12.3 compared to standard quadrature code with no g++ optimisations. The operation counts and FFC code generation time for the bilinear form for hyperelasticity with different FFC optimisations are presented in Table 3.9, while Figure 3.17 shows the run-time performance for different compiler options for 90 Chapter 3. Representations and optimisations of finite element variational forms
103 -O0 -O2 -O2 -funroll-loops -O3
102 time [s]
101
none -ip -zeros -basis -simplify -ip -zeros -basis -zeros -simplify -zeros
Figure 3.16: Run-time performance for the weighted Laplace equation for different compiler options. The x-axis shows the FFC compiler options, and the colors denote the g++ compiler options.
N = 5 104. Comparing the number of flops involved to compute the element × tensor to the weighted Laplace example, it is clear that this problem is considerably more complex. The FFC code generation times in Table 3.9 show that the -simplify optimisation, as anticipated, is the most expensive to perform. The g++ compile- times for all test cases were less than three seconds for all optimisation options. A point to note is that the scope for reducing the flop count is considerably greater for this problem than for the weighted Laplace problem, with a difference in the number of flops spanning several orders of magnitude between the different FFC optimisations. This compares to a difference in flops of roughly a factor two between the non-optimised and the most effective optimisation strategy for the weighted Laplace problem. In the case where no g++ optimisation is used the run-time performance for the hyperelastic problem can be directly related to the number of floating point operations. When the g++ optimisation -O2 is switched on, this effect becomes less pronounced. Another point to note, in connection with the g++ optimisations, is that switching on additional optimisations beyond -O2 does not seem to provide any further improvements in run-time. For the hyperelasticity example, the option -zeros has a positive effect on the performance, in particular when combined with the -basis and -simplify optimisations. This is in contrast with the weighted Laplace equation. The reason is that the test and trial functions are vector valued rather than scalar valued, which allows more zeros to be eliminated. Finally, it is noted that the -simplify option performs particularly 3.5. Performance comparisons of quadrature optimisations 91
FFC FFC time optimisation [s] o/q flops o/q None 1.8 1.00 56228760 1.000 -zeros 1.8 1.00 38844456 0.691 -simplify 6.9 3.83 3086595 0.055 -simplify -zeros 5.8 3.22 185697 0.003 -ip 2.0 1.11 44310392 0.788 -ip -zeros 2.9 1.61 12562106 0.223 -basis 2.0 1.11 3664392 0.065 -basis -zeros 3.0 1.67 1609430 0.029
Table 3.9: FFC code generation times and operation counts for the hyperelasticity example.
104 -O0 -O2 -O2 -funroll-loops 103 -O3
102 time [s]
101
100
none -ip -zeros -basis -simplify -ip -zeros -basis -zeros -simplify -zeros
Figure 3.17: Run-time performance for the hyperelasticity example for different compiler options. The x-axis shows the FFC compiler options, and the colors denote the g++ compiler options. 92 Chapter 3. Representations and optimisations of finite element variational forms well for this example compared to the weighted Laplace problem. The reason is that the nature of the hyperelasticity form results in a relatively complex expression to compute the entries in the local element tensor. However, this expression only consists of a few different variables (components of the inverse of the Jacobian and basis function values) which makes the -simplify option very efficient since many terms are common and can be precomputed and hoisted. For the hyperelasticity form, the optimal combination of optimisations is FFC option -simplify -zeros and g++ option -O2 -funroll-loops. This combination improves the run-time performance by approximately one order of magnitude compared to all other FFC options when g++ optimisations are included. Compared to the case where no optimisation is used by either FFC or g++, the run-time performance of the code is improved by a factor of 744. For the considered examples, it is clear that no single optimisation strategy is the best for all cases. Furthermore, the generation phase optimisations that one can best use depends on which optimisations are performed by the g++ compiler. It is also very likely that different C++ compilers will give different results for the test cases presented in this section. The general recommendation for selecting the appropriate optimisation for production code will therefore be that the choice should be based on a benchmark program for the specific problem.
3.6 Automatic selection of representation
In this chapter it has been illustrated how the run-time performance of the generated code for variational forms can be improved by using various optimisation options for the FFC and g++ compilers, and by changing the representation of the form. Numerical experiments have shown that the relative run-time performance of the two representations can differ substantially depending on the nature of the considered variational form. In general, the tensor contraction approach deals well with forms which involve high-order bases and few coefficient functions, whereas the quadrature representation is more efficient as the number of coefficient functions (other than constants coefficients) and derivatives in a form increases. Hence, in general the quadrature representation is significantly faster for more complicated forms. In an automated modelling framework, like FEniCS, it seems natural to attempt to select the most favourable representation automatically. When comparing the two representations in Section 3.4 it was found that the operation count is a reasonably good indicator for which form will exhibit the best run-time performance. FFC presently computes the operation count for the code which is generated, on the basis of which a choice could be made, but this involves generating computer code for each case which can be time consuming. Ideally, the form compiler would select the best representation based on an a priori inspection of the form. It turns 3.7. Future optimisations 93 out, however, that this is a non-trivial task if the goal is a general approach which holds for any form which FFC can handle. Furthermore, as it has been shown in the previous section, the code with the lowest number of flops, at least for the quadrature representation, does not always perform best for a given form. Finally, the run-time performance even depends on which g++ options are used. A strategy for selecting between representations based only on an estimation of flops does, therefore, not seem feasible. Choosing the combination of form representation and optimisation options that leads to optimal performance will inevitably require a benchmark study of the specific problem. However, very often many variational forms of varying complexity are needed to solve more complex problems. Setting up benchmarks for all of them is cumbersome and time consuming. Additionally, during the model development stage run-time performance is of minor importance compared to rapid prototyping of variational forms as long as the generated code performs reasonably well. The default behavior of FFC is, therefore, to automatically determine which form representation should be used based on a measure for the cost of using the tensor representation. In short, the cost is simply computed as the maximum value of the sum of the number of coefficients and derivatives present in the monomials representing the form. If this cost is larger than a specified threshold, currently set to three, the quadrature representation is selected. Recall from Table 3.6 that when n f = 3 the flops for quadrature representation was significantly lower for virtually all the test cases. Although this approach may seem ad hoc, it will work well for those situations where the difference in run-time performance is significant. It is important to remember that the generated code is only concerned with the evaluation of the local element tensor and that the time needed to insert the values into a sparse matrix and to solve the system of equations will reduce any difference, particularly for simple forms. Therefore, making a correct choice of representation is less important for forms where the difference in run-time performance is small. A future improvement could be to devise a strategy for also letting the system select the optimisation strategy for the quadrature representation automatically. Regardless of whether it is possible to define an optimal strategy for automatically selecting the representation (and possibly the optimisation), the applicability of automated modelling is definitely extended by having both tensor contraction and quadrature representations, and their optimisations, as part of the computational arsenal.
3.7 Future optimisations
The optimisations proposed in Section 3.3.5 for the quadrature representation are primarily concerned with the run-time performance of the generated code and the 94 Chapter 3. Representations and optimisations of finite element variational forms strategies follow along similar lines as the ones already implemented and discussed in Section 3.3. However, as the number of FEniCS users has increased, so has the complexity of the problems that users are trying to solve. In Section 3.4 it was demonstrated that, for some of the more complicated forms, the tensor contraction representation can take hours to generate code for the given problem and that the size of the generate code can become very large. For very complex forms, typically nonlinear forms that are linearised automatically by UFL, similar trends can be observed also for the quadrature representation. It is, therefore, necessary to develop new strategies for the code generation process to reduce the generation time and the size of the generated code. Two possible approaches that could be investigated are outlined below. Cur- rently, the code to compute derivatives of, for instance, basis functions like the term d d ∂X /∂x ∂Φ (Xq)/∂X in (3.4) is located inside the loop over ∑β=1 ∑α1=1 α1 β i1 α1 basis function indices j and k, see for instance Figure 3.2. From the UFL input
UFL code element= FiniteElement("Lagrange", triangle, 1) u= TrialFunction(element) v= TestFunction(element) a= inner(grad(u),grad(v)) *dx the generated code for the loop over basis function indices will be
C++ code for (unsigned intj= 0; j< 3; j++) { for (unsigned intk= 0; k< 3; k++) { A[j*3+k]+= (((K _00*FE0_D10[0][j]+K _10*FE0_D01[0][j]))* ((K_00*FE0_D10[0][k]+K _10*FE0_D01[0][k]))+ ((K_01*FE0_D10[0][j]+K _11*FE0_D01[0][j]))* ((K_01*FE0_D10[0][k]+K _11*FE0_D01[0][k])))*W1*det; } } which is almost identical to that in Figure 3.2. However, the only difference between the code to compute the derivative of u and v is the loop index because u and v are defined using the same finite element. Thus precomputing the derivatives outside the loop will lead to a reduction in the code size (and in the number of operations needed). The improved code for the given case would then become:
C++ code double FE0_d0[3]; double FE0_d1[3]; for (unsigned intr= 0; r< 3; r++) { 3.7. Future optimisations 95
FE0_d0[r]= (K _00*FE0_D10[0][r]+K _10*FE0_D01[0][r]); FE0_d1[r]= (K _01*FE0_D10[0][r]+K _11*FE0_D01[0][r]); } for (unsigned intj= 0; j< 3; j++) { for (unsigned intk= 0; k< 3; k++) { A[j*3+k]+= ((FE0 _d0[j]*FE0_d0[k])+ (FE0 _d1[j]*FE0_d1[k]))*W1*det; } }
The drawback of this approach is that the optimisations discussed in Section 3.3, particularly the -f simplify optimisation, could be less effective as fewer common expressions involving the geometry constants like K_00 will be present. To reduce the size of the code even further (and possibly also improve run- time performance), a linear algebra library, for instance Armadillo (http://arma. sourceforge.net/) could be employed to perform block operations using optimised BLAS. The generated code will then become:
C++ code arma::vec FE0_d0(3); arma::vec FE0_d1(3); for (unsigned intr= 0; r< 3; r++) { FE0_d0[r]= (K _00*FE0_D10[0][r]+K _10*FE0_D01[0][r]); FE0_d1[r]= (K _01*FE0_D10[0][r]+K _11*FE0_D01[0][r]); }
arma::mat R= (FE0 _d0*arma::trans(FE0_d0)+ FE0 _d1*arma::trans(FE0_d1))*W1*det;
// Copy values toA double* p= R.memptr(); for (intr=0; r<9; r++) A[r]=p[r];
In the given case, the size of the code has not been reduced significantly. The approach will be particularly effective in situations involving, for instance, the inverse operator in UFL. The inverse operator in UFL (only defined for 1 1, 2 2 × × and 3 3 matrices, is hardcoded as a function of the matrix components. This × leads to a very complex expression inside the loop over basis functions when following the conventional quadrature approach which can be substituted by a simple function call to arma::inv.4 The strategy outlined above could have a negative influence on the run-time performance due to overhead in the linear algebra library or by making it more difficult for the g++ compiler to perform optimisations.
4This approach might not be feasible for linearisations of the inverse when using the automatic differentiation functionality in UFL. 96 Chapter 3. Representations and optimisations of finite element variational forms
As demonstrated in this chapter, having multiple representations and optimi- sations available when considering variational forms of different complexity is an advantage in an automated framework as it is the combination of form com- plexity, FFC optimisations and g++ compiler options that determines the run-time performance of the generated code. Implementing the strategies outlined above will, therefore, extend the applicability of FEniCS to a range of even more complex problems than what can be handled at present. 4 Automation of discontinuous Galerkin methods
Discontinuous Galerkin methods in space have emerged as a generalisation of finite element methods for solving a range of partial differential equations. While histori- cally used for first-order hyperbolic equations, discontinuous Galerkin methods are now applied to a range of hyperbolic, parabolic and elliptic problems. In addition to the usual integration over cell volumes that characterises the conventional finite element method, discontinuous Galerkin methods also involve the integration of flux terms over interior facets. Discontinuous Galerkin methods exist in many vari- ants, and are generally distinguished by the form of the flux on facets. A sample of fluxes for elliptic problems can be found in Arnold et al.(2002). Integration of functions on interior facets and evaluating flux terms, expressed as jumps and averages of quantities of interest, adds complexity to the standard finite element procedure. Therefore, it is obviously desirable to also handle these types of formulations in an automated fashion as this permits the rapid prototyping and testing of new methods. In this chapter the necessary extensions to the FEniCS framework for implementing discontinuous Galerkin formulations are presented. Specifically, new abstractions in UFL, FFC, UFC and DOLFIN are needed in order to handle the automation of the characteristic features of discontinuous Galerkin methods. The extended framework is then demonstrated through a range of common problems, including the Poisson, advection–diffusion, Stokes and biharmonic equations. The presentation of the extensions in this chapter is based on the work in Ølgaard et al.(2008a) 1. Although the functionality is implemented with discontinuous Galerkin meth- ods in mind, it also allows a range of novel finite element methods that draw upon discontinuous Galerkin methods to be handled automatically by the FEniCS framework. These methods may not involve discontinuous function spaces but do involve integration over interior facets. Such examples can be found in Hughes
1In the original paper, the discontinuous Galerkin operators were implemented in the form language of FFC which was later merged into UFL. The code examples from the paper have also been updated to be compliant with FEniCS version 1.0. 98 Chapter 4. Automation of discontinuous Galerkin methods
T− S
T+
+ Figure 4.1: Two cells T and T− sharing a common facet S. et al.(2006); Wells and Dung(2007); Labeur and Wells(2007).
4.1 Extending the framework to discontinuous Galerkin methods
Discontinuous Galerkin methods involve variational forms that include integrals over the interior facets of a finite element mesh. Consider for example the following bilinear form which may appear as a term in a discontinuous Galerkin formulation: Z a (u, v) := ∑ u v ds, (4.1) S Γ SJ KJ K ∈ 0 where Γ denotes the set of all interior facets of the triangulation and v denotes 0 Th the jump in the function value of v across the facet S: J K
+ v = v v−. (4.2) J K − + Here, v and v− denote the values of v on the facet S as seen from the two cells + T and T− incident with S, respectively (see Figure 4.1). Note that each interior + facet is incident to exactly two cells which may be labelled T and T−. The union of these two cells, T = T+ T , will be referred to as the macro cell. ∪ − In order to handle variational forms such as (4.1) in the FEniCS framework, additional functionality is needed in a number of components. Obviously, UFL must be extended to support the definition of integrals over interior facets. These integrals may involve functions which can be evaluated on either of the two cells incident to the interior facet. DOLFIN must be extended to support assembly of multilinear forms containing interior facet integrals which in turn requires the UFC interface to be extended with a new integral class. As UFC is only concerned with the interface of this class, FFC must support code generation for interior facet integrals defined using the UFL syntax. The following sections describe the extensions that have been developed in each of these four components. 4.1. Extending the framework to discontinuous Galerkin methods 99
Mathematical notation UFL notation + f , f − f(’+’), f(’-’) f avg(f) h i f jump(f), jump(f, n) J K Table 4.1: Table of discontinuous Galerkin operators in UFL.
4.1.1 Extending the Unified Form Language As illustrated in (4.1) and (4.2), a central concept of discontinuous Galerkin methods + is the possibility that an expression f has two values, denoted f and f −, on an + interior facet S when it is evaluated based on the two cells T and T− which are incident with S. The UFL notation f(’+’) and f(’-’) is used to restrict an + expression f to T and T− respectively. It is possible to implement a number of common operators for discontinuous Galerkin methods using these simple definitions. For convenience, UFL provides the set of operators presented in Table 4.1 to facilitate compact implementation of these methods. Two typical operators are the average and jump operators, frequently denoted by f and f , h i respectively. The definition of the average operator is f = ( f + + f )/2 whileJ theK h i − definition of the jump operator is, in general, f = f + f as shown in (4.2). For − − convenience, these two operators are availableJ inK UFL by avg(f) and jump(f). It is common to use the outward unit normal, denoted by n, to the interior facet when defining the jump operator such that for a scalar valued expression f = f +n+ + + J +K f −n−, while for a vector or tensor valued expression f = f n + f − n−. + J K · · In both definitions, n and n− denote the outward unit normal to the interior + facet, S, as seen from the two cells T and T− respectively. These two definitions are implemented in a single operator jump(f, n) by letting UFL automatically determine the rank of the expression, f, and return the appropriate definition. It should be pointed out that because UFL is an embedded language and because of the restriction operators f(’+’) and f(’-’) a user can easily implement custom operators for discontinuous Galerkin methods. What remains, in order to express variational forms of the type shown in (4.1), is to define a notation for the interior facet integral. Following the notation for the domain and exterior boundary integrals introduced in Section 1.3.2, the integral over R interior facets I dS is simply written as I * dS(k) where k is the subdomain Γ0,k number and I is a valid UFL expression. The extensions described above facilitate compact implementation of a range of discontinuous Galerkin methods using a syntax which is close to the mathematical notation. As a simple illustration, the bilinear form in (4.1) is represented in UFL by: UFL code a= jump(u) *jump(v)*dS 100 Chapter 4. Automation of discontinuous Galerkin methods
4.1.2 Extending the Unified Form-assembly Code
As DOLFIN relies on the UFC interface when evaluating local finite element tensors, the UFC interface must define the tabulate_tensor function also for interior facet integrals. This function is provided by the class ufc::interior_facet_integral and the interface is
C++ code /// Tabulate the tensor for the contribution froma local interior facet virtual void tabulate_tensor(double* A, const double * const * w, const cell& c0, const cell& c1, unsigned int facet0, unsigned int facet1) const = 0; where A is a pointer to an array which will hold the values of the local element tensor and w contains nodal values of any coefficient functions present in the + integral. The two cells c0 and c1 correspond to the cells T and T− incident with the given facet S while facet0 and facet1 are the local indices of the facet S relative to the cells c0 and c1 respectively. This is illustrated in Figure 1.6b, page 19, where the local facet (edge) index of the shared facet is e0 relative to one cell while it is e2 relative to the other cell. The implication of this aspect is elaborated in the following section.
4.1.3 Extending the FEniCS Form Compiler
FFC must also be extended in order to generate code for the new integral class in UFC to evaluate the local facet tensor. In Section 3.2.2, it was shown how the cell tensor (element tensor) can be computed from the tensor representation
0 α AT,i = ∑ AiαGT. (4.3) α
Similarly, one may use the affine mappings (defined in Section 3.2.1) F + and F to T T− obtain a tensor representation for the interior facet tensor AS. However, depending on the topology of the macro cell T, one obtains different tensor representations. For a triangular mesh, each cell has three facets (edges) and there are thus 3 3 = × 9 different topologies to consider; there are nine different ways in which two edges can meet. Similarly, for a tetrahedral mesh, there are 4 4 = 16 different × topologies to consider. Notice that this is only true because FFC assumes the UFC numbering convention of mesh entities, outlined in Section 1.3.4 and illustrated in Figure 1.6b, which guarantees that two incident simplicial cells always agree on the orientation of an incident facet. If no particular ordering of the mesh entities is assumed, one needs to consider 3 3 2 = 18 different topologies for triangles × × and 4 4 6 = 96 topologies for tetrahedra. This is because there are two different × × ways to superimpose two edges, and there are six different ways to superimpose two faces. The tensor representation for the interior facet tensor can then be written 4.1. Extending the framework to discontinuous Galerkin methods 101 in the form 0, +( ), ( ) = f S f − S α , (4.4) AS,i ∑ Aiα GT(S) α + where f and f − denote the local numbers of the two facets that meet at S relative + α to the two cells T and T− respectively. Note that the geometry tensor GT in (4.3) involves the mapping from the reference cell and differs from the geometry tensor Gα in (4.4), which may involve the mapping from the reference cell and the T(S) + mapping from the reference facet. The reference tensor A0, f , f − is precomputed + for each facet–facet combination ( f , f −) and a run-time decision must be made as to which reference tensor should be contracted with the geometry tensor. The FFC machinery which generates code for each facet–facet combination based on UFL expressions is largely unaffected by the extensions to discontinu- ous Galerkin methods. As a consequence, the quadrature representation can be extended in a similar fashion taking into account the differences between the two representations described in Section 3.2. Furthermore, the optimisations presented in Section 3.3 also apply to variational forms containing interior facet integrals.
4.1.4 Extending DOLFIN
To assemble the global sparse tensor A for variational forms that contain integrals over interior facets as in (4.1), one may extend the standard assembly algorithm over the cells of the computational mesh (see Algorithm1, page 27) by including an iteration over the interior facets of the mesh. The approach is described for the bilinear form in (4.1) where, for ease of notation, it is assumed that u, v V. ∈ Adopting the notation from Section 1.3.5 the tensor A which arises from assembling the bilinear form in (4.1) can be expressed as: