Code optimisation for data-based global carbon cycle

analysis

André Schembri

August 19, 2016

i. TABLE OF CONTENTS i. Table of Contents ...... 1 ii. Table of Figures ...... 3 iii. Table of acronyms ...... 5 iv. Abstract ...... 6 v. Acknowledgments ...... 7 1 Introduction ...... 8 2 Literature Review ...... 10 2.1 CARDAMOM ...... 10 2.2 Target System ...... 12 2.2.1 The Eddie3 Computer Cluster ...... 12 2.2.2 The Xeon® E5-2630 v3 processor...... 12 2.3 Code Optimisation ...... 16 2.3.1 Increasing instruction level parallelism ...... 16 2.3.2 Using less expensive instructions ...... 16 2.3.3 Using optimal memory access patterns ...... 17 3 Research Methods ...... 18 3.1 Identifying the best compilation strategy ...... 18 3.2 Code Optimisation ...... 19 3.3 Keeping track of changes ...... 19 3.4 Scripts used to run tests ...... 19 3.5 Verification of results ...... 22 3.6 Determining performance gain and loss ...... 22 4 Compilation Strategies ...... 23 4.1 List and description of compiler flags determined to be important ...... 23 4.2 Results of different optimisation levels...... 24 4.3 Results of different combination of compilation strategies ...... 26 5 Code Optimisation ...... 28 5.1 Hotspot Analysis ...... 28 5.2 Carbon Model Code Optimisation ...... 30 5.2.1 Calculate time and temperature dependencies...... 30 5.2.2 Calculate Fluxes code block and the ACM function ...... 36

1

5.2.3 Extracting function from the main Carbon_Model loop ...... 45 5.2.4 Calculate growing session index code block ...... 46 5.3 The EDC2_GSI subroutine optimisation ...... 48 5.4 Merging the optimisations together ...... 50 6 Conclusions ...... 53 7 Future Work ...... 53 Appendix A – Code of the carbon model sub procedure ...... 54 Appendix B – Scripts used to run benchmarks ...... 73 Shell script used to run benchmarks ...... 73 Script used to collect results ...... 74 Appendix C – Simple unit test ...... 76 Appendix D – Other Results ...... 79 Results which were not collected with Eddie ...... 79 Results collected with Eddie ...... 82 Bibliography ...... 83

2

ii. TABLE OF FIGURES

FIGURE 1-1 THE CARBON EXCHANGES WITHIN A TERRESTRIAL ECOSYSTEM (TAKEN FROM [2]) ...... 8 FIGURE 2-1 THREE STEPS USUALLY UNDERTAKEN TO GENERATE DATA WITH CARDAMOM ...... 10 FIGURE 2-2 A UML ACTIVITY DIAGRAM SHOWING A HIGH LEVEL OVERVIEW OF HOW CARDAMOM WORKS, TOGETHER WITH THE CPU TIME SPENT IN EACH PART OF CARDAMOM...... 11 TABLE 2-1 DIFFERENT NODE TYPES THAT COMPOSE THE EDDIE3 COMPUTE CLUSTER (TAKEN FROM HTTPS://WWW.WIKI.ED.AC.UK/DISPLAY/RESEARCHSERVICES/MEMORY+SPECIFICATION) ...... 12 TABLE 2-2 SHOWING THE MAXIMUM TURBO BOOST THAT CAN BE ACHIEVED BY THE DIFFERENT CORES WITHIN INTEL® XEON® PROCESSOR E5-2630 V3, BOTH WHEN USING AVX INSTRUCTIONS AND OTHER X86 INSTRUCTIONS. (TAKEN FROM INTEL® XEON® PROCESSOR E5 V3PRODUCT FAMILY - PROCESSOR SPECIFICATION UPDATE [11])...... 14 FIGURE 2-3 THE DIFFERENT EXECUTION PORTS AND EXECUTION UNITS IN HASWELL MICROARCHITECTURE TAKEN FROM [7]...... 15 FIGURE 3-1 DIRECTORY STRUCTURE TO RUN CARDAMOM TESTS IN BULK, WITH SCRIPTS TO BUILD EXECUTABLE (MAKEFILE), RUNNING ALL THE TESTS AND BUILD FILES (RUNALL.SH) AND COLLECTING RESULTS (GETRESULTS.PY) ...... 21 TABLE 4-1 LIST OF DIFFERENT COMPILATION FLAGS AVAILABLE FOR THE INTEL COMPILER AND THEIR INTENDED EFFECT ...... 24 FIGURE 4-1 TIME TAKEN TO RUN CARDAMOM COMPILED WITH INTEL COMPILER USING DIFFERENT OPTIMISATION LEVELS, USING THE UK FORESTRY PARAMETER FILE AS AN INPUT, THE RESULTS SHOW THAT THE BEST OPTIMISATION LEVEL IS THE O2 LEVEL...... 25 FIGURE 4-2 TIME TAKEN TO RUN CARDAMOM WITH THE UK FORESTRY SAMPLE USING DIFFERENT COMPILATION STRATEGIES (FOR SINGLE FILE SEE FOOTNOTE), THE RESULTS SHOW THAT THE BEST PERFORMING COMPILATION STRATEGY IS O2 -XHOST -IPO -NO-FTZ. SINGLE FILE MEANS THAT THE COMPILATION WAS NOT LINKED INTO TWO DIFFERENT COMPILATION BUT COMPILED IN A SINGLE COMPILATION ...... 27 FIGURE 5-1 A TREE DIAGRAM SHOWING HOW THE CPU TIME WAS SPENT IN EACH PART OF THE APPLICATION (SOME VALUES BELOW 1% ARE OMITTED)...... 29 FIGURE 5-2 CODE BLOCK: "CALCULATE TIME AND TEMPERATURE DEPENDENCIES" FROM CARDAMOM’S CARBON MODEL METHOD ... 30 FIGURE 5-3 COMPUTING A NUMBER THAT WHEN MULTIPLIED GENERATES (ALMOST) THE SAME RESULTS AS DIVIDING THE NUMBER AND ASSIGNING IT TO DETATDIVISOR ...... 31 FIGURE 5-4 REPLACING THE DIVISOR (/DELTAT(N)) WITH A MULTIPLIER (*DELTATDIVISOR) ...... 32 FIGURE 5--5-5 TIME TAKEN TO RUN CARDAMOM WITH (A) INSTANCES OF /DELTAT(N) REPLACED WITH RECIPROCAL MULTIPLICATION (B) THE ORIGINAL CODE. BY REPLACING THE DIVISIONS BY THE RECIPROCAL MULTIPLICATION, THE PERFORMANCE WAS ANALYSED TO BE ~2.63% HIGHER ...... 32 FIGURE 5-6 TIME TAKEN TO RUN CARDAMOM WITH (A) POOLS ARRAY DIMENSIONS SWAPPED (B) POOLS ARRAY AND FLUXES ARRAY DIMENSIONS SWAPPED (C) ORIGINAL CODE WITHOUT MODIFICATIONS, RESULTS SHOW A SLIGHT INCREASE IN PERFORMANCE WHEN SWAPPING THE POOLS ARRAY...... 34 FIGURE 5-7 DECLARATION OF THE POOLS_INTERNAL ARRAY , AND TRANSPOSING THE POOLS AND STORING RESULTS INTO THE POOLS_INTERNAL ARRAY ...... 35 FIGURE 5-8 TIME TAKEN TO RUN CARDAMOM WITH (A) NO CODE CHANGES (B) POOLS ARRAY BEING TRANSPOSED INTO NEW ARRAY (C) POOLS AND FLUXES TRANSPOSED INTO NEW ARRAY. RESULTS SHOW THAT BY TRANSPOSING THE POOLS AND FLUXES ARRAYS USING THE TRANSPOSE INTRINSIC RESULT IN PERFORMANCE DEGRADATION...... 35 FIGURE 5-9 CODE SAMPLE SHOWING HOW WRITES TO FLUXES ARE REPLICATED INTO FLUXES_CURRENT_TIME_STEP, AND FLUXES_CURRENT_TIME_STEP IS THEN USED FOR READS ...... 36 FIGURE 5-10 RUNNING CARDAMOM WITH (A) REPLICATION OF FLUXES WRITES INTO FLUXES_CURRENT_TIME_STEP TO IMPROVE READS AND (B) ORIGINAL CODE.RESULT SHOW THAT BY REPLICATING THE ARRAY WRITES TO A WELL FORMED ARRAY INCREASE THE PERFORMANCE...... 36 FIGURE 5-11 THE CALCULATE FLUXES CODE BLOCK ...... 37 FIGURE 5-12 SHOWING AVERAGE TIME REQUIRED TO RUN CARDAMOM WITH (A) ACM MANUALLY INLINED IN CARBON_MODEL MAIN LOOP (B) RUNNING THE ORIGINAL CODE BASE. RESULTS SHOW THAT INLINING BY INLINING ACM THE PERFORMANCE DEGRADED.. 38 FIGURE 5-13 THE MAIN PART OF THE ACM FUNCTION ...... 39

3

FIGURE 5-14 THE DECLARATION OF THREE SAVED VARIABLES NITMULTIPLIEDBYNAI, ABS_DELTAWP_EXP_HYDRAULICEXPONENT, HYDRAULIC_TEMP_COEF_MULTIPLIED_RTOT USED TO PRECOMPUTE VALUES FOR THE ACM FUNCTION ...... 41 FIGURE 5-15 THE ASSIGNMENT OF NITNUEMULTIPLIEDBYNAI, ABS_DELTAWP_EXP_HYDRAULICEXPONENT AND HYDRAULIC_TEMP_COEF_MULTIPLIED_RTOT ...... 41 FIGURE 5-16 PRECOMPUTING LAI*LAI IN ORDER TO REDUCE COMPUTATIONS AND INCREASING PIPELINE EFFICIENCY ...... 41 FIGURE 5-17 THE PARTS OF ACM WHERE PARTS OF THE VARIABLE ASSIGNMENTS WERE REPLACED WITH PRECOMPUTED VALUES ...... 42 FIGURE 5-18 RESULTS SHOWING THE TIME TAKEN TO RUN CARDAMOM WITH (A) RUNNING THE ORIGINAL CODE (B) SOME COMPUTATIONS OF ACM BEING COMPUTED OUTSIDE THE FUNCTION(C) SOME COMPUTATIONS OF ACM BEING COMPUTED ONCE FOR THE WHOLE OF THE LOOP AGAINST (C) , RESULTS SHOW THAT BY PRECOMPUTING ONCE THE PERFORMANCE DECREASES SIGNIFICANTLY, WHILE PRECOMPUTING AT EACH MHMCMC STEP THE PERFORMANCE INCREASES SLIGHTLY...... 42 FIGURE 5-19 DECLARATION OF THE TWO DIMENSIONAL ARRAY "ACMOPTIMISATIONARRAY" USED TO STORE PRECOMPUTED VALUES USED WITHIN THE ACM FUNCTIONS, FIRST DIMENSION WILL REPRESENT ONE OF THE 4 PARAMETERS (‘TRANGE’, ‘GC’, PARTS OF ‘PN’, AND PARTS OF ‘MULT’), THE SECOND DIMENSION WILL REPRESENT THE DAY...... 43 FIGURE 5-20 DECLARATION OF "CALCULATETRANGE" FUNCTION WHICH COMPUTES THE TRANGE VALUE AS PREVIOUSLY DONE WITHIN THE ACM FUNCTION ...... 43 FIGURE 5-21 DECLARATION OF THE "GENERATEGC" FUNCTION WHICH COMPUTES THE GC AS PREVIOUSLY DONE WITHIN THE ACM FUNCTION...... 44 FIGURE 5-22 PRECOMPUTATIONS OF TRANGE, GC AND PRECOMPUTATION OF PARTS OF THE PN AND MULT EXPRESSIONS USED WITHIN THE ACM FUNCTION. THE PRECOMPUTATIONS ARE DONE ONES, HOWEVER THESE ARE CALCULATED FOR EACH DAY THAT CARDAMOM IS COMPUTING AS THE MET ARRAY (USED FOR THE CALCULATION OF TRANGE AND PN) CAN DIFFER FOR DIFFERENT DAYS BUT IS STATICALLY DEFINED WITHIN THE INPUT FILE...... 44 FIGURE 5-23 THE PARTS OF THE ACM FUNCTION CHANGED TO USE PRECOMPUTED VALUES (HIGHLIGHTED), TRANGE WAS ASSIGNED THE PRECOMPUTED VALUE, GC IS ALSO ASSIGNED A PRECOMPUTED VALUE, PARTS OF PN HAVE BEEN PRECOMPUTED, LAI*LAI HAS BEEN COMPUTED BEFORE WITHIN THE ACM FUNCTION AND USED WITHIN THE E0 ASSIGNMENT...... 45 FIGURE 5-24 AVERAGE TIME TAKEN TO RUN CARDAMOM USING (A) THE ORIGINAL SOURCE CODE (B) SOURCE CODE MODIFIED TO PRECOMPUTE GC, TRANGE AND PARTS OF E0 AND MULT USED WITHIN THE ACM FUNCTION. THE RESULTS SHOW THAT BY DOING PRECOMPUTATIONS OF GC, TRANGE AND PARTS OF E0 AND MULT THE PERFORMANCE INCREASED DRASTICALLY ...... 45 FIGURE 5-25 RESULTS OF RUNNING THE CODE WITH A) NO CHANGES, B) EXTRACTING SUB PROCEDURES FROM THE MAIN LOOP AND ENFORCING THE COMPILER TO NOT INLINE THE NEW PROCEDURE BY THE USE OF DIRECTIVES, C) EXTRACTING SUB PROCEDURES ENFORCING NO INLINING ON THE NEW SUB PROCEDURES AND FORCING INLINING ON ACM, AND C) EXTRACTING SUB PROCEDURES WITHOUT USING ANY DIRECTIVES (EITHER INLINING OR NO INLINING)...... 46 FIGURE 5-26 THE CALCULATE GROWING SESSION INDEX CODE BLOCK ...... 47 FIGURE 5-27 CHANGES PERFORMED TO THE CALCULATE GROWING SESSION INDEX CODE BLOCK ...... 47 FIGURE 5-28 AVERAGE TIME TAKEN TO RUN CARDAMOM WITH (A) DIVISION OPERATION REPLACED WITH RECIPROCAL MULTIPLICATION WITHIN THE CALCULATE GROWING SESSION INDEX CODE BLOCK, AND (B) THE ORIGINAL CODE ...... 48 FIGURE 5-29 EXAMPLE OF THE BRANCHES USED WITHIN THE EDC2_GSI FUNCTION ...... 48 FIGURE 5-30 PSEUDO CODE ILLUSTRATING HOW THE CHANGE WAS APPLIED ...... 49 FIGURE 5-31 AVERAGE TIME TAKEN TO RUN CARDAMOM WITH (A) GSI IF STATEMENT SIMPLIFIED, WITH NEW BRANCH IN THE SAME FUNCTION (B) ORIGINAL CODE (C) GSI BRANCH SIMPLIFIED AND EXTRACTED INTO A NEW FUNCTION ...... 50 FIGURE 5-32 AVERAGE TIME TO RUN CARDAMOM USING DIFFERENT ACCUMULATED CHANGES COMPARED WITH THE ORIGINAL CODE 52 FIGURE D-1 BENCHMARK OF DIFFERENT COMPILATION STRATEGIES OF THE INTEL FORTRAN COMPILER, REPORTED AVERAGE IS OF 5 CONCURRENT RUNS REQUESTING 1,000,000 PARAMETERS FOR UK FORESTRY SAMPLE FILE, TESTS WERE RUN ON A DESKTOP PC RUNNING LINUX THAT HAS AN INTEL HASWELL I7 58290K PROCESSOR ...... 80 FIGURE D-2 BENCHMARK OF DIFFERENT COMPILATION STRATEGIES OF THE GNU FORTRAN COMPILER, REPORTED AVERAGE IS OF 5 CONCURRENT RUNS REQUESTING 1,000,000 PARAMETERS FOR UK FORESTRY SAMPLE FILE, TESTS WERE RUN ON A DESKTOP PC RUNNING LINUX THAT HAS AN INTEL HASWELL I7 58290K PROCESSOR ...... 81 FIGURE D-3 DIFFERENT COMPILATION STRATEGIES BASED ON O3 OPTIMISATION LEVEL FOR THE INTEL FORTRAN COMPILER ...... 82

4

iii. TABLE OF ACRONYMS

CARDAMOM: CARbon DAta MOdel fraMework

DALEC: Data assimilation linked ecosystem carbon

DAZ: Denormals are zero

ECDF: Edinburgh Compute and Data Facility

EDC: Ecological and dynamical constraints

FTZ: Flush to zero (when number becomes a denormal)

GCEL: The Global Change Ecology Lab (GCEL) at the School of GeoSciences in the University of Edinburgh.

Gfortran: GNU Fortran Compiler

GSI: Growing season index

HPC: High performance computing

Ifort: Intel Fortran Compiler

IPO: Interprocedural optimisation

MHMCMC: Metropolis-Hasting Markov chain Monte Carlo

5

iv. ABSTRACT

The purpose of this project was to optimise the CARbon DAta MOdel fraMework (CARDAMOM) through serial code optimisation. Through the hotspot analysis performed it was observed that a particular sub procedure in the application consumed more than 81% of the total running time of CARDAMOM. By determining the optimal compilation strategy, the application was observed to run more than 15% faster than when using the compilation strategy used before. Furthermore, through multiple code changes the execution of CARDAMOM was observed to run a further 5% faster.

6

v. ACKNOWLEDGMENTS

I would like to thank Terry Sloan and Dr. Luke Smallman for their continuous support, dedication and patience.

7

1 INTRODUCTION

The global carbon cycle describes the exchanges of carbon within and between different ecosystems within the biosphere. Understanding the global carbon cycle is important because of the long term impact of Co2 in the atmosphere and hence global climate change [1], as well as for forecasting changes in natural resources to help improve management in the shorter term.

Different carbon pools take part of this cycle through different processes (see Figure 1-1). For instance, carbon is transferred from the atmosphere into the terrestrial biosphere through plants and trees as part of photosynthesis. Various events then determine the pool from which the carbon from plants is transferred. Some examples are: 1. When plants die, a small percentage of the captured carbon remains as fixed carbon; 2. If plants burn, most of the carbon stored is transferred to the atmosphere; 3. During respiration, plants also release back some carbon they previously captured [1].

When an ecosystem gains more carbon than it loses it is called a net sink, on the contrary when an ecosystem loses more carbon than it gains it is called a net source[1].

Figure 1-1 The carbon exchanges within a terrestrial ecosystem (taken from [2]) 8

The terrestrial ecosystem plays an important role in the global carbon cycle since it that exhibits the largest inter-annual variability in gain and losses of carbon and has the largest uncertainty. The uncertainty is due to corresponding uncertainties in size, physical distribution and the complex dynamics of carbon in its main pools [3]. Furthermore, due to the complexity and dynamics of the terrestrial ecosystem the balance between sinks and pools remain poorly constrained[3].

In recent years, in order in order to overcome such situations, Model-data fusion (MDF) has started to be used[4]. In MDF observations such as those from flux towers, satellites and trait databases [3] are statistically combined with current scientific hypotheses and understanding. This, in turn, reduces the effects of the poorly constrained terrestrial carbon cycle [4]. The CARbon DAta MOdel fraMework (CARDAMOM) is an analysis tool available in the Fortran and C computer programming languages. The Global Change Ecology Lab (GCEL) at the School of GeoSciences in the University of Edinburgh is developing CARDAMOM. CARDAMOM makes use of the MDF Model to provide analysis about the terrestrial carbon cycle. It uses a Metropolis Hastings Markov Chain Monte Carlo (MHMCMC) as its statistical method to assimilate the observed data with the current scientific models, resulting in a data constrained carbon cycle analysis. CARDAMOM works in a serial fashion by analysing each specific geographical location independently. It uses the assimilated data to compute the necessary correlations. By using small MHMCMC chains to represent geographical locations, CARDAMOM is considered embarrassingly parallel. It is, however, still very computationally expensive as it requires approximately 44,000 independent tasks each running for around 3hours to perform a global analysis. Such computational requirements are large enough to potentially limit the amount of research that could be feasibly be done with current technology. Therefore, optimising the performance of CARDAMOM will allow for better use of the resources which in turn could allow for further research to be done. This dissertation will analyse the CARDAMOM Fortran code base to find the optimal way to create a more performant executable. The University of Edinburgh’s HPC facility EDDIE3 and the Intel Fortran compiler are the target platform and compiler respectively. This will be performed by first determining the optimal compilation strategy for CARDAMOM (Section 4). Next, the best portions of code to optimise (Section 5.1) will be identified and modified to improve performance (Sections 5.2 - 5.3). Finally tests to verify performance gains will be executed (Section 5.4) and conclusions will be made (Section 6).

9

2 LITERATURE REVIEW

In this section a brief analysis of the information that was needed to perform this project is listed. In the first part of the chapter important information about CARDAMOM, the application that was optimised is provided.

The second part of the chapter summarises important information about the system being targeted as well as specific details about the CPU used by the target system, this information is crucial both for the creation of reliable tests

In the third section a quick overview of different code optimisation techniques are provided.

2.1 CARDAMOM The Carbon Data Model framework (CARDAMOM) is a terrestrial carbon cycle analysis tool, CARDAMOM can generate predictions of the terrestrial carbon fluxes between different carbon pools, more specifically the plant and the soil pools, as well as their exchanges with the atmosphere. To achieve this CARDAMOM uses a Markov chain Monte Carlo Metropolis-Hastings (MHMCMC) search algorithm to select parameters based on an estimate of their likelihood (see Figure 2-2). Each parameter set is then used to generate a representation of the fluxes and state of the carbon cycle by using the DALEC2 method. Afterwards the computed parameters containing the resulting carbon cycle estimate are assessed with ecological and dynamical constraints (EDCs) [3] which accept/reject parameters based on whether they are consistent with current ecological understanding .

CARDAMOM is available in both C and Fortran programing languages versions. These codebases are designed to use the same algorithms and produce the same results.

Generate Parameter CARDAMOM Cleaning resulting data and Input File (C or FORTRAN) producing required Data and charts

Figure 2-1 Three steps usually undertaken to generate data with CARDAMOM

CARDAMOM is a three steps process (see Figure 2-1):

1. Input file are generated containing all prior information 2. CARDAMOM analysis retrieves a specified number of parameter vectors 3. The retrieved parameters are used to generate a group of estimates about the terrestrial carbon cycle which are typically represented graphically.

Running a large-scale terrestrial carbon analysis is performed by running multiple CARDAMOM instances (as separate independent tasks), each of these tasks handles a different part of the geological area of interest (and therefore is supplied with a different input files). Afterwards, the results produced by each task are cleaned and merged together.

10

Parse Input Arguments

Read Parameters file

Randomly sample the parameters 2.1% (Find EDC initial values) that are unkown

Number of solutions requested <= number of accepted solutions?

Y

N

take step in parameter space 6.89% (STEP)

Validate parameters with EDC1*

Use DALEC2 to compute fluxes 76.9% (CARBON MODEL)

90.9% (MHMCMC) Validate parameters with EDC2* 8.6% (EDC2)

compute likelihood

Accept/Reject and update accaptence rate

if time to adapt adapt adapt step size

*EDC validation is used only when EDC is enabled by the user

Figure 2-2 A UML Activity Diagram showing a high level overview of how CARDAMOM works, together with the CPU time spent in each part of CARDAMOM

11

2.2 TARGET SYSTEM

2.2.1 The Eddie3 Computer Cluster The Global Change Ecology Lab (GCEL)1 at The University of Edinburgh, uses the Edinburgh Compute and Data Facility (ECDF)2 resources to run CARDAMOM.

EDDIE is one of the services offered by ECDF. EDDIE is a compute cluster composed of 368 nodes and is currently in its third revision [5]. Throughout this document EDDIE would be used to refer to the third revision of EDDIE.

EDDIE is composed of different types of compute nodes shown in Table 2-1.

Table 2-1 Different node types that compose the Eddie3 compute cluster (taken from https://www.wiki.ed.ac.uk/display/ResearchServices/Memory+Specification) Node Specification Quantity Processor Standard 16 Cores, 64 GB 190 nodes/3040 Intel® Xeon® Processor E5-2630 v3 RAM cores (2.4 GHz) Intermediate 16 Cores, 192 GB 63 nodes/1008 Intel® Xeon® Processor E5-2630 v3 RAM cores (2.4 GHz) Large 32 Cores, 2 TB RAM 2 nodes/64 cores Intel® Xeon® Processor E7-4820 v2 (2.0 GHz) IGMM 16 Cores, 128 GB 107 nodes/1712 Intel® Xeon® Processor E5-2630 v3 Standard RAM cores (2.4 GHz) IGMM Large 16 Cores, 768 GB 6 nodes/192 cores Intel® Xeon® Processor E5-2630 v3 RAM (2.4 GHz)

A standard node is composed of two Intel® Xeon® Processor E5-2630 v3 (Haswell microarchitecture) CPUs, with each CPU having eight physical cores each, amounting to a total of sixteen cores [6]. Hyper Threading is disabled and therefore, each CPU core can run a single thread. Each CPU core can be allocated a maximum of 8GB of RAM. However, the total amount of RAM available on a standard node is 64GB which means that physically there are 4GB of RAM for each node.

2.2.2 The Xeon® E5-2630 v3 processor The Xeon® E5-2630 v3 CPU used by the standard and intermediate nodes on Eddie3 is a superscalar, multicore processor which is based on the Intel Haswell microarchitecture (4th Generation of the Intel Core series). The Intel Haswell microarchitecture uses several features and techniques in order to increase power efficiency as well as performance, these features effect both which optimisations make

1 http://www.geos.ed.ac.uk/gcel 2 ECDF is part of The University of Edinburgh, see http://www.ed.ac.uk/information-services/research- support/research-computing/ecdf for further information. 12

sense on this architecture as well as the fluctuations expected in the benchmarks. Some of the features of the Haswell architecture are described further in the sub-sections below.

2.2.2.1 Power and performance efficiency In order to increase power efficiency in the Haswell microarchitecture, Per Hammarlund et al [7] recount that three categories of improvements were made : low-level implementation improvements, high-level architecture improvements and platform power management.

The low level implementation improvements include improvements related to the manufacturing process, the materials used, optimisation of algorithms used in the microarchitecture as well as better design and implementation (i.e. high utilisation of gating and use of low-power modes)[7].

Per Hammarlund et al [7] stated that the high level architecture improvements include treating different elements in the processor such as the cores, caches and the system agent as different voltage-frequency domains. The power control unit (PCU) would then prioritise the power budget distribution to these domains in order to return the best performance from the processor. This means that the power budget is dynamically distributed between different parts of the processor to make best use of the power budget.

On a platform level, the Haswell microarchitecture has different C-states (CPU idle states)[7] and P- states (CPU performance states)[8]. These allow the processor to enter idle states and change performance states in order to decrease and increase performance and power consumption. Furthermore, Per Hammarlund et al [7] argue that because Haswell uses a multiphase fully Integrated voltage regulator, a 2x to 3x increase available peak power can be achieved in the Haswell processors. This can in turn be used for burst performance (such as with Intel Turbo Boost).

This means that the Haswell class of processors increase performance and efficiency both by dynamically distributing the best power ratios to different parts of the processor, as well as by changing the power supplied to the processor both to decrease the power consumption when the processor is underutilised (Intel Speed Step) and to increase performance when the CPU is over utilised (Intel Turbo Boost). It is important to note that the Intel Turbo Boost technology 2.0 featured in the E5-2630 V3 CPU enables the CPU to run on higher frequencies than the standard ones (up to 3.2 GHz [9]) in order to handle large loads. However, Intel [10] emphasises that this boost is only enabled when the CPU is working certain CPU specification limits such as those related to power, current and temperature.

Furthermore, as shown in Table 2- the maximum frequency that can be achieved in each core differs according to the core being used.

13

Core Core Core Core Core Core Core Core 1 2 3 4 5 6 7 8 Intel® Turbo Boost Technology Maximum Core Frequency (GHz) 3.2 3.2 3 2.9 2.8 2.7 2.6 2.6 Intel® AVX Turbo Boost Technology Maximum Core Frequency (GHz)3 3.2 3.2 3 2.9 2.8 2.7 2.6 2.6 Table 2-2 showing the maximum Turbo Boost that can be achieved by the different cores within Intel® Xeon® Processor E5-2630 v3, both when using AVX instructions and other X86 instructions. (Taken from Intel® Xeon® Processor E5 v3Product Family - Processor Specification Update [11]).

Therefore, it can be concluded that the features used for power and performance management in the Haswell architecture can cause fluctuations in the time taken to run the same executable on the same processor, and therefore can compromise the results of the benchmarks produced. In order to reduce the effect of such fluctuations the benchmarks can be produced by utilising all the cores in the CPU (i.e. running 16 CARDAMOM instances concurrently) and running the tests for multiple times.

2.2.2.2 Out of Order Execution engine CPUs have different functional units to process integers, floating points, arithmetic and logic, memory addresses etc. In order to use these functional units concurrently Haswell processors provide multiple ports each containing a set of functional units (see Figure 2-3).

These ports are fed with micro-operations. Micro-operations are generated by decoding X86 instructions into compound/fused micro-operations, these are then split into simpler micro-operations which can then be sent to different execution ports to be executed by different units or stored in a cache[7].

The Haswell microarchitecture has 8 execution ports (shown in Figure 2-3) each of these ports maps to a different set of functional units. The ports are fed from the front end pipeline which uses two main sources [7]:

 A cache/decoder pipeline which decodes 16 bytes of instructions and supplies up to four compound micro-operations  A micro-operations cache which stores decoded micro-operations and can supply them at a rate of 32 bytes per cycle

Since each execution port can only execute one micro instruction per cycle, in order to use the processor efficiently micro instructions should be ordered in order to use as many ports as possible. Therefore, by coding and compiling code which makes better use of the out of order execution engine, one can get better overall performance from the application.

However, this poses some challenges mainly that some operations depend on each other forming a dependency chain. Furthermore, loops and if statements (especially longer branches) in code can result in large dependency chains that can block the processor from using the out of order execution engine correctly until executed.

3 In the original document this is marked as MHz but it is assumed that this is a mistake and should have been marked as GHz in the original document. 14

In order to minimize the effect of branches, processors do branch prediction. As the name implies branch prediction is a process that is used to try to predict branches. This is done in order to preemptively execute code and accept/reject the changes performed by the code according to what the actual branch result is.

Figure 2-3 The different execution ports and execution units in Haswell microarchitecture taken from [7].

This can be reduced both by the compiler through , simplification of expressions etc. and by manual code changes by helping the processor predicting the branches more properly, increasing the chances of loop unrolling and reducing the congestion on a specific port (i.e. distributing manually operations that are frequent in specific parts of the code), as well as by simplifying the expressions by hand.

2.2.2.3 SIMD support Current generation CPUs such as the Xeon® E5-2630 support Single instruction multiple data (SIMD) instructions with technologies such as MMX, AVX and AVX2[9]. These instructions enable the processor to execute an instruction on multiple instances of data (e.g. variables) in the case of AVX2 the CPU support up to 256bit operations meaning up to 4 double precision operations or up to 8 single precision operations per cycle which can be FMA instructions. In Haswell it is reported that by using AVX2 one can achieve a throughput of 1 addition per cycle, 2 multiplications per cycle and 0.04 division operations per cycle [12].

15

In order for the Compiler to generate the AVX2 instructions the code should be instructed to generate the code with vectorisation (automatic with optimisation level O1 or higher) and should target for the AVX2 platform[13]. Furthermore, the Fortran code should be structured have some specific attributes such as being part of a loop, contiguous memory access, no write after read dependencies and no flow dependencies (i.e. data dependencies between iterations)[13].

Furthermore, in the publication written by Wittmann et al[12], it was reported that the performance of AVX differs greatly between the one reported and the one actually achieved. This is due to the fact that with compiled code the processor has to fetch the named variable from the register, whilst the numbers usually reported assume that ran is handwritten assembly code.

2.3 CODE OPTIMISATION When a code base that runs serially uses efficient algorithms, changes to increase performance are usually related to modifying code to use the CPU more efficiently. Examples of such changes include: reducing the CPU stalls, increasing instruction level parallelism, making use of lesser expensive instructions, using better memory access patterns, reducing the number of jumps within instructions and making better use of code registers.

Compilers such as the Intel Fortran compiler can be configured to generate more efficient executables by performing transformations during compilation of the source code[14]. However, unless profile- guided optimisation[15] is used the compiler is unaware of how the code is run and therefore the compiler is limited to static code analysis. Furthermore, compilers are software applications which target multitudes of different scenarios and which cannot make educated assumptions on the code being generated and therefore might not recognise every opportunity for optimisation.

Therefore, apart from determining how to best configure the compiler to generate efficient code, it is also beneficial to change the code to help the compiler produce better resultant executables. Such changes are explained in the following subsections.

2.3.1 Increasing instruction level parallelism As described in section 2.2.2.2, modern processors can execute instructions out of order. However, instructions depending other instructions which still needs to be computed cannot be executed before the instruction is computed, therefore forming a dependency chain[16]. Although the out of order execution engine can schedule different dependency chains, Long dependency chains limit the ability of the out of order execution unit to process instructions in parallel[16] [14].

Therefore, it is beneficial to reduce the dependency chains within the application. This can sometimes be done by enclosing parts of an expression in parenthesis [16], and making the branches shorter.

2.3.2 Using less expensive instructions Different instructions exhibit different latencies and execution times [17], some of the instructions such as the division operations can in certain cases be replaced with other less expensive instructions such as a reciprocal multiplication in case of float operations [16].

16

2.3.3 Using optimal memory access patterns In order to make best use of the processors’ features such as caches and prefetchers code should be written to take advantage of the locality of reference[16][18] i.e. access to data should preferably make best use of temporal locality, spatial locality[18] or equidistant locality. Temporal locality refers to repeated access to the same data, spatial locality refers to data access near the previous access[18], equidistant locality refers to data access that is accessed in an equidistant fashion (i.e. always moving by a number of elements).

17

3 RESEARCH METHODS

The objective of this project is to make the CARDAMOM code base run more efficiently through code optimisation, while keeping or improving the correctness of the simulation.

When configured to do so, compilers can do some optimisation without any code changes to the CARDAMOM code base. Therefore, it is ideal to identify the optimal compilation strategy to be used before doing any manual code changes. This would significantly reduce unnecessary code changes to parts of the code that would otherwise be already optimised by the compiler.

After the best overall compilation strategy is identified, the parts that can take the highest percentage of the running time (i.e. hotspots) can be determined. These will then be investigated for manual code optimisation in order to gain further.

3.1 IDENTIFYING THE BEST COMPILATION STRATEGY The main objective of compiling code is to generate assembly code from code written in a high level programming language. However, good compilers also aim to generate assembly code that performs well.

Compilers are able to produce better performing code by using multiple strategies which in the case of GNU Fortran compiler and the Intel Fortran compiler are configurable (typically through command line options). Different strategies can change the total runtime of code by for example reducing the amount of stalls because of cache misses, making better utilisation of the pipeline by reordering statements, reducing the number of jumps within the assembly or reduce the size of the code so that the code fits better in the instruction cache.

However, depending on the nature of the code base being compiled and the system being targeted, different compilation strategies can work better than others. Furthermore, changes to the code might result in better performance benefit for specific compiler transformations while having the adverse effect with a different transformation technique.

Therefore, it is ideal to determine the best compilation strategy before starting hotspot analysis and optimisation of the code base.

The compilation strategy that will be used will be determined by first short listing a number of different flags that might be beneficial to investigate, then by determining a set of different compilation strategies (i.e. combination of these flags). Afterwards, these compilation strategies will be tested by building the application with each of these compilation strategies, and running and timing the code with each of these compilation strategies for 64 times. The average of each of these runs will then be used to form a benchmark to determine the compilation strategy that would be used.

It should be noted that more importance would be due to the Intel Compiler results than the GNU Fortran ones, this is done because CARDAMOM is usually built with the Intel Compiler.

18

3.2 CODE OPTIMISATION Compilers can do a lot to optimise the resultant assembly code so that the application runs faster. However, there are some aspects of optimisation that the compiler cannot do. This is mainly due to the fact that although compilers are excellent in detecting dependencies in the code, (depending on the optimisation flags used) they cannot perform unsafe transformations that can break the code. Furthermore, compilers are usually unaware of how the code the code is ran (although exceptions to this exist). Therefore, manual code optimisation can allow for further efficiency gains.

In order for the code changes to be most effective, code changes should usually target the parts of the code that are identified to use the most resources (hotspots) during the application execution. Therefore, in order to identify the hotspots of CARDAMOM, hotspot analysis will be performed by using Intel Vtune Amplifier as a profiler, the main metric that will be used initially is the CPU time.

Once the hotspots are identified, each hotspot will be analysed for possible improvements. Each of the identified improvements will be implemented ran and compared with the original code base running time, as well as verified to be producing good results.

3.3 KEEPING TRACK OF CHANGES During code optimisation a lot of different code changes were undertaken. In order to keep track of each of these changes, a Subversion (SVN) was used as a version control system.

The SVN was used by committing each different code optimisation in a different branch. Finally, to merge all different code, the code changes were compared and merged with Eclipse utilities and were committed to a different branch.

3.4 SCRIPTS USED TO RUN TESTS In order to reduce the amount of time in manual compilation, timing of running of executables and collection of the time taken to run the executables a simple set of scripts were created in order to automate some of these processes to run the tests on Eddie.

The scripts were designed to use a specific directory structure shown in Figure 3-1, so that when the script called RunAll.sh (code in Appendix B – Scripts used to run - Shell script used to run benchmarks) was invoked all the subdirectories of the siblings’ directories were cleaned and built using the Makefile, then the executable being tested was run for four consecutive times in sixteen concurrent runs (therefore each test was run for sixty-four times) and each run was timed using the time4 command. The results of the time taken as well as messages outputted by each CARDAMOM run were written to a directory named outs into two different files for each run.

Once all the executions were completed, GetResults.py (code in Appendix B – Scripts used to run - Script used to collect results) was invoked to collect the time taken for each execution from the outs directory of each test, the mean average of these results was computed and outputted to the console and a csv file containing the name of the folder of the tests together with the results of each run was outputted.

4 http://linux.die.net/man/1/time 19

The csv file generated was then loaded into a spreadsheet and the average of these runs together with the standard division were computed.

20

MakeFile

Test 1 Directory Cardamom Files and Directories

Random Seed GetResults.py Directory

MakeFile Test 2 Directory

RunAll.sh Cardamom Files Test Directory and Directories

MakeFile GetResults.py Single Seed Directory Cardamom Files and Directories Test 1 Directory

MakeFile

Test 2 Directory

Cardamom Files and Directories

Figure 3-1 Directory structure to run CARDAMOM tests in bulk, with scripts to build executable (MakeFile), Running all the tests and build files (RunAll.sh) and collecting results (GetResults.py) 3.5 VERIFICATION OF RESULTS Changes in the code base should continue to give proper results. In order to check that the running of the code was producing the expected results three different checks were performed.

The first test used was to use the statistical methods used in the Rscripts in CARDAMOM to validate that the results generated are correct. The second test used only for the Carbon_Model procedure was a simple unit test (see Appendix C – Simple unit test) which checks that for given input parameters the procedure is giving the correct output parameters within a certain tolerance of error.

Another test utilised was to check the equivalence of the results generated by the original and the modified executable. As CARDAMOM uses an MHMCMC which depends on the randomness of the application, for these tests CARDAMOM was modified to use the same seed to produce an equivalent result. However, some tests break the equivalence between each other while still being correct.

Unless otherwise stated, all the results documented have been tested to pass the verification tests.

3.6 DETERMINING PERFORMANCE GAIN AND LOSS Modern microprocessors feature several technologies to help in achieving higher efficiency. These technologies include (but are not limited to) multiple cores per processor, performance bursts (the CPU can increase or decrease the multiplier depending on the load and state in order to reduce electricity consumption or increase performance) and usage of multiple units in each cycle by pipelining.

These technologies help greatly in achieving higher efficiency, but they can make it very difficult to run the same application in a contestant time span. Such issues can be caused by resources sharing of caches and main memory, physical state and load of the CPU which in turn can limit the maximum core frequency, as well as which core the application is being run on (see section 2.2.2.1) which can also limit the highest achievable clock frequency.

Apart from the fluctuations produced by the hardware, CARDAMOM also produces fluctuations in the time taken for execution due to the use of a random number generator in its MHMCMC algorithm. These fluctuations can be reduced by always using the same seed. However, this approach could hide some variations that are shown only when using a random seed.

Therefore, the approach that will be taken is to generate a “big enough” sample of 64 runs performed by using a whole standard EDDIE node for four times (4 runs of 16 concurrent runs) for each test. Then, the mean average of these runs will be calculated together with the standard deviation used to produce the error bars. Two different set of tests will be collected for each result: one set using the same seed (seed used -12.3) and the other set using a random seed, all the tests used the UK forestry sample.

4 COMPILATION STRATEGIES

CARDAMOM is usually compiled with the Intel Compiler using the default -O2 compilation level without any further optimisation options. In order to avoid unnecessary manual code optimisation that otherwise the compiler can do, the ideal optimisation strategy for CARDAMOM was determined.

Different compilation flags were investigated and shortlisted. Afterwards, each optimisation level (-O0, - O1, -O2, -O3 and -Ofast) was benchmarked to determine which set performs best with CARDAMOM. Then the best optimisation level was further fine-tuned by analysing further configuration and transformation flags.

All the benchmarks were performed by using the UK Forestry parameters file as the sample data and by requesting 1 million parameters. These tests were run for 64 times each in order to fill a whole Eddie standard node for 4 times.

Since, the main objective of this project is to optimise this project for the Intel Fortran compiler, much more effort was put in place to find the best compilation strategy for the Intel Fortran compiler compared with the GNU Fortran compiler. However, some of the GNU Fortran compiler results are reported in Appendix D – Other Results.

4.1 LIST AND DESCRIPTION OF COMPILER FLAGS DETERMINED TO BE IMPORTANT The compiler flags that are to be described can be divided between those which group a large set of optimisations usually to form different optimisation levels and generally include optimisations which are optimal together, and those which do specific optimisations or configurations.

The following descriptions are mainly taken from the intel documentation [19]

Compiler Short Description flag -O0 This instructs the compiler not to do any optimisations -O1 This optimises for size, therefore transformations which make the resulting assembly code larger (such as inlining) are not included -O2 This flag is used to instruct the compiler to do transformations and use options to increase speed. Such optimisation includes intra-file optimisations, inlining (including intrinsic sub programs), auto-vectorisation, enabling flush to zero, control speculation, , and further optimisations [20] -O3 This does all the optimisations performed by O2 but enables more aggressive transformations related to loops and branches [20], Intel recommends this optimisation when the application does a lot of floating-point calculations and processes large data sets [20] -Ofast This enables all the optimisations enabled at O3, but includes -no-prec-div [21] which decreases the precision of floating-point divides and -fp-model fast=2 [22] which enables aggressive optimisations on floating point data [22] -IPO This enables interprocedural optimisation, this enables the compiler to do some optimisations by taking into consideration the program as a whole (all the files source files together) in order to perform more inlining [23].

23

-NO-FTZ This places an instruction in the resultant assembly code so that a the DAZ and FTZ flag in the hardware are set not to flush denormals to zero [24] -Xhost This flag instructs the compiler to use all the instructions supported by the host (compiling the application) processor. This enables the compiler to generate only the code needed to run the application on the same class of processors[25]. Furthermore, it also allows the compiler to use other information about the processor[26].

However, since the assembly code generated is targeted to a specific processor , the assembly instructions generated can be unsupported with other processors (especially older ones, and ones which are not produced by Intel)[25]. Table 4-1 List of different compilation flags available for the intel compiler and their intended effect

4.2 RESULTS OF DIFFERENT OPTIMISATION LEVELS The first set of benchmarks were undertaken to determine the best optimisation level available. The results shown in Figure 4-1 show that the optimisation level that worked best was the O2 optimisation level. The figure shows that the optimisation level that performed worst was the O3 optimisation level. This performed worse than compiling the application with no optimisation O0. Furthermore, the O3 and the Ofast optimisation levels gave incorrect results.

The O2 optimisation strategy is designed to optimise for speed, the Intel documentation [20] suggests that such optimisations can include transformation such as:

 Transformations performed by taking into consideration all the different code blocks within the same file (intra file optimisations[20])  Removing code which is never reached ([20])  Eliminating variables which are assigned but never used (dead store elimination[20])  Reducing function calls by inlining intrinsic functions[20]  Simplification of expression through techniques such as copy propagation[20]  Various techniques to utilise the CPU registers more efficiently by reducing the number of stores required[20].  Enables vectorisation [20], by default SSE2 instructions.

24

2500.00

2000.00

1500.00

1000.00 Average Time Taken (s) Taken Time Average

500.00

0.00 O0 O1 O2 O3 Ofast Single Seed 1225.93 644.76 528.59 2293.46 530.59 Random Seed 1182.22 563.23 448.22 1174.70 645.20 Optimisation Level Used

Single Seed Random Seed

Figure 4-1 Time taken to run CARDAMOM compiled with Intel compiler using different optimisation levels, using the UK forestry parameter file as an input, the results show that the best optimisation level is the O2 level. 4.3 RESULTS OF DIFFERENT COMBINATION OF COMPILATION STRATEGIES Once the best compilation optimisation option was found, different options were used to tweak further the compilation process of CARDAMOM. The results shown in Figure 4-2 show that from the analysed options it was determined that the best performing strategy is achieved by specifying the combination of -O2 -xhost -ipo -no-ftz options.

The Interprocedural optimisation (IPO) compiler option instructs the compiler to do optimisations across all the files included in the compilation. The use of -IPO together with -xhost options alone gave a performance benefit of ~14% (based on single seed) when compared with compiling CARDAMOM with the O2 option.

The -no-ftz option instructs the compiler to switch off the DAZ (Denormalised are zero) and FTZ (Flush to zero) in hardware for the whole execution [24]. It should be noted that DAZ and FTZ hardware flags are enabled by default in executables generated by the Intel Fortran compiler on every optimisation level other than O0[24]. The DAZ and FTZ flags are usually used in order to reduce the cost of underflow values which cannot be represented in a normalised floating point values [12][24].In the case of SSE and AVX instructions the hardware can handle denormals according to the IEEE 754 standard [27] with a higher cost[24], Markus et al [12] state that the cost of computing denormals can be twice as bigger when compared to using DAZ and FTZ. However, by setting denormals to zero the results are less accurate. Therefore, it is thought that the ~10% difference (based on single seed run) between running the application with -O2 -xhost -ipo -no-ftz and -O2 -xhost -ipo -no-ftz is achieved by increasing the accuracy which in turns results in less steps taken in the MHMCMC chain not passing the likelihood functions.

600.00

500.00

400.00

300.00

Average Time Taken (s) Taken Time Average 200.00

100.00

0.00 O2 -xhost -ipo -no-ftz -no- O2 -xhost -ipo -no-ftz Single O2 -xhost -ipo -no-ftz O2 -xhost -ipo O2 -xhost -ipo Single File prec-div File Single Seed 402.95 403.14 449.51 478.04 476.87 Random Seed 380.92 381.20 412.68 418.59 428.84 Compilation Strategy Used

Single Seed Random Seed

Figure 4-2 Time taken to run CARDAMOM with the UK Forestry sample using different compilation strategies (for Single File see footnote), the results show that the best performing compilation strategy is O2 -xhost -ipo -no-ftz. Single file means that the compilation was not linked into two different compilation but compiled in a single compilation 5 CODE OPTIMISATION

Although the compiler can do a lot transformation to increase performance, manual code changes can increase performance further by helping the compiler produce better performing code. Therefore, this chapter describes how further analysis of manual code to increase the performance of CARDAMOM was undertaken. First, a hotspot analysis (section Hotspot Analysis5.1) of the application was undertaken to determine where most of the time was being spent in the application, here it was observed that a specific loop in CARDAMOM consumes more than 92.1% of the total execution time of CARDAMOM.

In the second section, changes will be proposed and tested, the test results will be then analysed (sections 5.2 to 5.4). Here, after manual code optimisation was undertaken around 5-6% (depending on the benchmark used) overall increase was observed.

5.1 HOTSPOT ANALYSIS In order to determine how the time in the application is being spent between different sections of the application, hotspot analysis was undertaken.

The hotspot analysis was then undertaken by first compiling the code using the Intel Fortran compiler with the “-O2 -no-ftz -ipo” flags that had been determined earlier (see section 0) to form the optimal compilation for CARDAMOM. Next, the results were collected by starting CARDAMOM with the Intel’s Vtune Amplifier[28] as a profiler attached and configured to use “Basic Hotspots analysis”. Since the results showed that more than 90% of the total time was being spent in the Carbon_Model main loop the Carbon_Model’s main loop was subdivided into seven sub procedures in order to produce a more detailed breakdown of the hotspot, and the hotspot analysis was re run using the Vtune Amplifier “Basic Hotspot Analysis”.

The Vtune Amplifier “Basic Hotspot Analysis” uses the OS timer to interrupt the application at regular intervals to collect samples[28]. Statistical methods are then used by Vtune Amplifier to generate the results [28]. The advantage of sampling the application at regular intervals is that the overhead on the application is low. Therefore, the effect on the application performance and behaviour should be minimal.

Figure 5-1 uses a tree diagram to give a more detailed breakdown of how the time is spent within the application. In Figure 2-2, the same results are shown in the activity diagram.

The results gathered show that the part of the application that is using most of the CPU time is the Carbon Model procedure which on its own consumes 81.4% (77.7% in Model Likelihood and 2.1% in MHMCMC) of the total running time. The other 22.1% are mainly spent between validating results through the use of the EDC2_GSI function (9.1%), the calculations performed to take a step in the parameter space using the MHMCMC chain (6.5%) and the random sampling of parameters that are not defined (4.3%). Therefore, the manual code optimisation will be focused on these parts of CARDAMOM. CARDAMOM (100%)

Find edc initial values MHMCMC 4.3% 97.9%

MHMCMC Model Likelihood Step 4.3% 89.1% 6.5%

Model Likelihood Carbon Model EDC 2 GSI CARBON Model 1.6% 2.1% 9.1% 77.9%

Cal mean annual Carbon Model pools 1.4% 3.1%

Calculate Calculate time and Handle Calculate Update Pools Growing session Calculate Fluxes temperature Calculate Gpppars deforestation Gradient For timestep index 30.9% dependencies 0.5% 0.1% 3.2% 0.8% 3.2% 39.5%

ACM 28.0% goes into 29.0% exponents (pow())

Figure 5-1 A tree diagram showing how the CPU time was spent in each part of the application (some values below 1% are omitted)

5.2 CARBON MODEL CODE OPTIMISATION As identified in the previous section, the Carbon Model procedure is the major hotspot in CARDAMOM, consuming 81.4% (77.7% in Model Likelihood and 2.1% in MHMCMC) of the total CPU time in the application. In this section, various changes were tested out.

The Carbon Model procedure is mainly made up of a do loop which processes the analysis through time. In the case of the UK forestry file that has been used for this project, the number of time steps is 120, the number of retrieved parameter vectors was 1 million and therefore, this loop has been iterated into at least 120 million times. However, since this loop is quite large, throughout this document it is going to be divided into different regions named and referring to the same parts used in the hotspot analysis. This has been done in order to make it more understandable as to which parts of the loop the document is referring to. However, unless otherwise stated, the code being compiled is not extracting these functions.

The main regions of the Carbon Model that will be discussed are the Calculate Fluxes region and the Calculate Time and temperature dependencies region. It should be noted that the loop has been analysed for further vectorisation but no opportunity for vectorisation was found.

5.2.1 Calculate time and temperature dependencies The Calculate time and temperature dependencies code block shown in Figure 5-2 was determined to be a major hotspot consuming more than 35.5% of the total execution time. The Calculate time and temperature dependencies code block appears in the beginning of the Carbon Model loop.

! ! those with time dependancies !

! total labile release FLUXES(n,8) = POOLS(n,1)*(1.-(1.-FLUXES(n,16))**deltat(n))/deltat(n) ! total leaf litter production FLUXES(n,10) = POOLS(n,2)*(1.-(1.-FLUXES(n,9))**deltat(n))/deltat(n) ! total wood production FLUXES(n,11) = POOLS(n,4)*(1.-(1.-pars(6))**deltat(n))/deltat(n) ! total root litter production FLUXES(n,12) = POOLS(n,3)*(1.-(1.-pars(7))**deltat(n))/deltat(n)

! ! those with temperature AND time dependancies !

! respiration heterotrophic litter FLUXES(n,13) = POOLS(n,5)*(1.-(1.-FLUXES(n,2)*pars(8))**deltat(n))/deltat(n) ! respiration heterotrophic som FLUXES(n,14) = POOLS(n,6)*(1.-(1.-FLUXES(n,2)*pars(9))**deltat(n))/deltat(n) ! litter to som FLUXES(n,15) = POOLS(n,5)*(1.-(1.-pars(1)*FLUXES(n,2))**deltat(n))/deltat(n)

Figure 5-2 Code block: "Calculate Time and Temperature dependencies" from CARDAMOM’s Carbon model method

When further analysed it was determined that the code block has two major attributes that were taking much of the overall performance and that could be improved.

The first major issue was that in this part of the code, the array was being accessed from the slow moving dimension i.e. Fluxes(n,8) where the first element, n, is the timestep/day and the second element is the parameter. Since the loop was being iterated on the timestep and the parameters were being accessed throughout the loop, it was determined that the array was being accessed inefficiently.

The second major issue determined was that there are a lot of division operations being performed.

In the following sections, the experiments performed to reduce the effect of these two attributes will be presented.

5.2.1.1 Reducing division operations Divisions are used extensively in the Calculate time and temperature dependencies code block (see Figure 5-2). Division operations are known to perform slower because of high latency and lower reciprocal throughout [17].

In order to reduce the number of division operations, one approach is to replace them with multiplication. This can be achieved by multiplying by the numbers by the reciprocal as shown in Figure 5-3. This is performed by first dividing 1 by the divisor and storing it in a variable, then at each occurrence of division by deltat(n) within the loop, the divider can be replaced by multiplying the reciprocal as shown in Figure 5-4.

deltatDivisor = 1 / deltat(n)

Figure 5-3 Computing a number that when multiplied generates (almost) the same results as dividing the number and assigning it to detatDivisor

! ! those with time dependencies !

! total labile release FLUXES(n,8) = POOLS(n,1)*(1.-(1.-FLUXES(n,16))**deltat(n))*deltatDivisor ! total leaf litter production FLUXES(n,10) = POOLS(n,2)*(1.-(1.-FLUXES(n,9))**deltat(n))*deltatDivisor ! total wood production FLUXES(n,11) = POOLS(n,4)*(1.-(1.-pars(6))**deltat(n))*deltatDivisor ! total root litter production FLUXES(n,12) = POOLS(n,3)*(1.-(1.-pars(7))**deltat(n))*deltatDivisor ! ! those with temperature AND time dependencies !

! respiration heterotrophic litter FLUXES(n,13) = POOLS(n,5)*(1.-(1.-FLUXES(n,2)*pars(8))**deltat(n))*deltatDivisor ! respiration heterotrophic som FLUXES(n,14) = POOLS(n,6)*(1.-(1.-FLUXES(n,2)*pars(9))**deltat(n))*deltatDivisor ! litter to som 31

FLUXES(n,15) = POOLS(n,5)*(1.-(1.-pars(1)*FLUXES(n,2))**deltat(n))*deltatDivisor

Figure 5-4 Replacing the divisor (/deltat(n)) with a multiplier (*deltatDivisor)

The deltat(n) variable was also identified to be used in other areas of the Carbon Model loop including in calculate gradient code block and other parts of the loop that are not used a lot throughout the execution of CARDAMOM.

In order to test this hypothesis, all the occurrences of the deltat(n) divisor were replaced by a multiplier. This was performed by using the explanation above. The code was than compiled and run on Eddie for 16 concurrent runs for 4 times.

420.00

410.00

400.00

390.00

380.00

370.00

360.00

350.00

Average Time Taken(s) Time Average 340.00

330.00

320.00

310.00 Replacing division by reciprocal multiplication Original Code Single Seed 391.42 402.95 Random Seed 376.56 380.92 Code Run

Figure 5--5-5 Time taken to run CARDAMOM with (a) instances of /deltat(n) replaced with reciprocal multiplication (b) the original code. By replacing the divisions by the reciprocal multiplication, the performance was analysed to be ~2.63% higher

As shown in Figure 5, by changing the divisors to reciprocal multiplication, a performance gain of ~2.63% was achieved when comparing the comparing single seed runs performed with the original code. However, it should be noted that the reciprocal multiplication was used a lot of times in the same loop. Therefore, the cost of dividing the number 1 by deltat(n) was recovered through all the computations performed.

5.2.1.2 Array Access Pattern Arrays in Fortran are column-major, meaning that the fast moving element should be the first element (e.g. array(fastMovingElement, slowMovingElement)). The array declarations reflect the location of data in the memory. Therefore, in order to make the best use of the CPU caches and hardware prefetchers, it is beneficial that array access makes the best use of spatial locality or at least be predictable. The optimal usage of cache and prefetching will usually reduce the amount of CPU stalls waiting for data to 32

arrive from main memory or higher levels of cache (L2, L3 etc) and therefore, will increase the performance of the application.

The Fluxes and Pools arrays within the Carbon Model procedure were accessed on the slowest moving element (2nd dimension). However, in other parts of the application especially within the Model_Likelihood sub procedure, the elements were being accessed with the time step element (first element) as the fastest moving element.

In order to rearrange the memory locations of the POOLS and FLUXES arrays to be optimal for the Carbon Model procedure, different hypotheses were implemented and tested.

In the first set of tests, the 1st and 2nd dimension of the POOLS and FLUXES arrays were swapped throughout the whole CARDAMOM code base. This was done in order to put the fastest moving element on the first dimension.

Initially both the FLUXES and POOLS arrays’ dimensions were swapped and tested. Afterwards, the POOLS dimension was reverted back to its original shape and the POOLS array dimensions were kept swapped.

The results produced in Figure 5-6 indicate that by changing only the POOLS array, there was a minor improvement (based on single seed) in performance when compared with the original code. However, when both FLUXES and POOLS were swapped, the performance indicated a minor degradation. It is therefore concluded that by swapping the dimensions of FLUXES, the overall performance of CARDAMOM is slightly degraded because negative effect caused by the inverse access pattern within the Model_likelihood sub procedure is greater than the performance gain achieved within the CARBON_MODEL sub procedure. Furthermore, the swapping of the POOLS array resulted in performance degradation and higher variability when calculating the time using the random seed.

33

500.00

450.00

400.00

350.00

300.00

250.00

200.00

150.00 Average Time Taken (s) Taken Time Average 100.00

50.00

0.00 Swapping POOLS and FLUXES Swapping POOLS array Original Code array Single Seed 401.06 403.84 402.95 Random Seed 395.21 387.27 380.92 Code Run

Figure 5-6 Time taken to run CARDAMOM with (a) POOLS array dimensions swapped (b) POOLS array and FLUXES array dimensions swapped (c) original code without modifications, results show a slight increase in performance when swapping the POOLS array.

In the second set of tests, the Fortran Transpose intrinsic function was used. The Transpose function takes a two dimensional array and returns a transposed version of the array. It was hypothesised that by transposing the FLUXES and POOLS arrays, the data locality within the Carbon_Model procedure code could increase. However, it should be noted that the transposition of the arrays and storage of new arrays introduce new expensive operations in the CARBON_MODEL sub procedure.

In order to apply this hypothesis, additional two dimensional array(s) with inverted shape were declared. For the first test, one additional array was declared to store the transposed array for POOLS while, for the second test, another array was declared for the transposition of FLUXES. The POOLS array was transposed into POOLS_INTERNAL array before the execution of the loop within the CARBON_MODEL sub procedure (see Figure 5-7), then each expression using the POOLS array was changed to use the new array POOLS_INTERNAL. At the end of the procedure, the new array was transposed back into the POOLS array so as to output the expected values from the Carbon_Model sub procedure. The solution was then built and the first test was run. Afterwards the first test was augmented with the transposition of FLUXES array by applying the same procedure used for FLUXES, and the solution was tested again.

!declare variable to hold Pools Transposed double precision, dimension(nopools, (nodays+1)) :: POOLS_INTERNAL

34

POOLS_INTERNAL = Transpose(POOLS)

Figure 5-7 Declaration of the POOLS_INTERNAL array , and transposing the POOLS and storing results into the POOLS_INTERNAL array

The results of the second test shown in Figure 5-8 show that the change greatly decreased the performance. It is therefore concluded that the overhead required to transpose the arrays was greater than the benefit achieved in rearranging the access pattern.

700.00

600.00

500.00

400.00

300.00

Average Time Taken (s) Taken Time Average 200.00

100.00

0.00 Original Code Transpose POOLS Transpose POOLS and FLUXES Single Seed 402.95 642.82 423.83 Random Seed 380.92 391.94 412.98 Code Run

Figure 5-8 Time taken to run CARDAMOM with (a) no code changes (b) POOLS array being transposed into new array (c) POOLS and FLUXES transposed into new array. Results show that by transposing the POOLS and FLUXES arrays using the TRANSPOSE intrinsic result in performance degradation.

In the third set of tests, a temporary one dimensional array which replicates the FLUXES array was introduced. The array was populated by replicating each write made to the original FLUXES array within the loop (see Figure 5-9). Then, any read from the FLUXES array which occurred after the value was written to the new array was changed to use the temporary array (see Figure 5-9). This was done with the intention of increasing spatial locality of reads by replicating the array, while taking advantage of the write-back cache of the Haswell processor to reduce the drawback of replicating the array.

! Declaration Of new array… double precision :: fluxes_current_time_step(nofluxes)

! Replicate writes … FLUXES(n,2) = exp(pars(10)*0.5*(met(3,n)+met(2,n))) fluxes_current_time_step(2) = FLUXES(n,2)

… 35

! Use fluxes_current_time_step for reads once something is already stored beforehand in the loop fluxes_current_time_step(13) = POOLS(n,5)*(1.-(1. fluxes_current_time_step(2)*pars(8))**deltat(n))/deltat(n) FLUXES(n,13)= fluxes_current_time_step(13) Figure 5-9 Code sample showing how writes to FLUXES are replicated into fluxes_current_time_step, and fluxes_current_time_step is then used for reads

The results shown in Figure 5-10 show that by replicating the writes into a separate array to use it for reads improved slightly (~1.2%) the performance of the application. This change is preferred over swapping the arrays because it gives a higher performance gain on both benchmarks and it does not affect other subroutines within CARDAMOM which can be reused in other applications.

420.00

410.00

400.00

390.00

380.00

370.00

360.00

350.00

Average Time Taken (s) Taken Time Average 340.00

330.00

320.00

310.00 Replicating Writes of Fluxes Original Code Single Seed 398.05 402.95 Random Seed 376.76 380.92 Code Run

Figure 5-10 Running Cardamom with (a) replication of FLUXES writes into fluxes_current_time_step to improve reads and (b) original code.Result show that by replicating the array writes to a well formed array increase the performance.

5.2.2 Calculate Fluxes code block and the ACM function The Calculate Fluxes code block (Figure 5-11) and its callees consume 30.9% of the total running time of CARDAMOM. Most of this time is spent in the ACM function (see pg.69 in

36

Appendix A – Code of the carbon model) which takes 29% of the total running time of CARDAMOM (Figure 5-1).

! GPP (gC.m-2.day-1) if (lai(n) > 0.) then FLUXES(n,1) = acm(gpppars,constants) else FLUXES(n,1) = 0. end if ! temprate (i.e. temperature modified rate of metabolic activity)) FLUXES(n,2) = exp(pars(10)*0.5*(met(3,n)+met(2,n))) ! autotrophic respiration (gC.m-2.day-1) FLUXES(n,3) = pars(2)*FLUXES(n,1) ! leaf production rate (gC.m-2.day-1) FLUXES(n,4) = 0. !(FLUXES(n,1)-FLUXES(n,3))*pars(3) ! labile production (gC.m-2.day-1) FLUXES(n,5) = (FLUXES(n,1)-FLUXES(n,3)-FLUXES(n,4))*pars(13) ! root production (gC.m-2.day-1) FLUXES(n,6) = (FLUXES(n,1)-FLUXES(n,3)-FLUXES(n,4)-FLUXES(n,5))*pars(4) ! wood production FLUXES(n,7) = FLUXES(n,1)-FLUXES(n,3)-FLUXES(n,4)-FLUXES(n,5)-FLUXES(n,6)

Figure 5-11 The Calculate Fluxes Code Block

The ACM function was profiled using the Intel’s Vtune amplifier. The profiler was configured to use hardware counters to collect data such as the utilisation of each backend port, the amount of branch misses and machine clears. From the profiling session, it was determined that there was a high utilisation of ~100% of the the divider unit and of the third port of the backend which contains the load and store addresses unit (refer to section 2.2.2.2).

Through analysis of the code, it was determined that a lot of variables that were assigned required few operations to be computed which could contribute to such high usage of the load and store unit. The high utilisation of the divider was attributed to the use of the square root function and division operations. It was also identified that some of the values could be precomputed to reduce the number of times they need to be calculated. Therefore, most of the optimisations undertaken in the following subsections are related to reducing the amount of loads in the function, mainly by precomputing some values.

5.2.2.1 Inlining the ACM function In order to see the effects of inlining the ACM function manually in the Carbon Model sub procedure loop, it was inlined in the loop. As shown in Figure 5-12, the results are not satisfactory. This is expected to be due to the better transformations performed with the inter procedural optimisations, as well as other transformations performed after the inlining by the compiler.

37

600.00

500.00

400.00

300.00

200.00 Average Time Taken (s) Taken Time Average

100.00

0.00 ACM inlined in the main loop Original code Random Seed 427.93 380.92 Single Seed 469.83 402.95 Code Run

Figure 5-12 Showing average time required to run CARDAMOM with (a) ACM manually inlined in Carbon_Model main loop (b) running the original code base. Results show that inlining by inlining ACM the performance degraded

5.2.2.2 Analyses of ACM dependencies In order to examine which parts of the ACM function can be precomputed, each dependency used in the assignment of variables within the ACM function was examined and is described below.

Figure 5-13 shows the part of the ACM function that contains the computations. The function in its entirety can be found in (

38

Appendix A – Code of the carbon model (pg.69)).

! determine temperature range trange=0.5*(maxt-mint) ! daily canopy conductance, of CO2 or H2O? gc=abs(deltaWP)**(hydraulic_exponent)/((hydraulic_temp_coef*Rtot+trange)) ! maximum rate of temperature and nitrogen (canopy efficiency) limited photosynthesis (gC.m-2.day-1) pn=lai*nit*NUE*exp(temp_exponent*maxt) ! pp and qq represent limitation by diffusion and metabolites respecitively pp=pn/gc ; qq=co2_comp_point-co2_half_sat ! calculate internal CO2 concentration (ppm) ci=0.5*(co2+qq-pp+sqrt((co2+qq-pp)**2-4.0*(co2*qq-pp*co2_comp_point))) ! limit maximum quantium efficiency by leaf area, hyperbola e0=lai_coef*(lai*lai)/((lai*lai)+lai_const) ! calculate day length (hours) ! This is the old REFLEX project calculation but it is wrong so anyway here ! we go... dec=-23.4*cos((360*(doy+10.)/365)*deg_to_rad)*deg_to_rad mult=tan(lat*deg_to_rad)*tan(dec) if (mult>=1) then dayl=24.0 else if (mult<=-1) then dayl=0. else dayl=24.*acos(-mult)/pi end if

! ------! calculate CO2 limited rate of photosynthesis pd=gc*(co2-ci) ! calculate combined light and CO2 limited photosynthesis cps=e0*radiation*pd/(e0*radiation+pd) ! correct for day length variation acm=cps*(dayl_coef*dayl+dayl_const)

Figure 5-13 The main part of the ACM function

The calculation for the assignment of trange is dependent on maxt and mint. These in turn depend on drivers(2) and drivers(3). The drivers array is an input argument of ACM. When calling the ACM function, the Carbon Model procedure uses the gpppars array as an input argument for the drivers array. The gpppars(2) and gpppars(3) values are filled in by using the met array. The met array is a two dimensional array containing constraints and details about the weather for each time step. This array is filled once at the start up of CARDAMOM from the input parameters file supplied. Therefore, the trange values can be computed for each time step once for all the number of solutions requested (so 120 timesteps instead of 120 timesteps * 1000000 parameters requested).

The computation of GC is dependent on deltaWP , hydraulic_expontent, hydraulic_temp_coef, Rtot and trange. As discussed above, the trange value can be pre-computed by pre-computing it for each day, the advantage being that it can be used for all the number of solutions requested. The Rtot and deltaWP variables come from drivers(10) and drivers(9) respectively, which in turn come from gpppars(10) and gpppars(9) which are assigned a constant number in the Carbon_Model procedure. The hydraulic_temp_coef and hydraulic_expontent are constants. Therefore, the GC variable can be precomputed in its entirety once for all the samples requested. 39

The pn computation is dependent on lai, nit, NUE, temp_expontent and maxt. The lai variable is dependent on gpppars(1) and the POOLS(n,2), which is changed within the Carbon_Model loop during each time step and therefore cannot be pre-computed. However, the nit, nue and temp_expontent are identified to be constants. Furthermore, the maxt as has been already discussed above is constant, through the calculation of each sample but not of time step. Therefore, some of the pn computation can be precomputed.

The pp computation is dependent on pn and therefore cannot be precomputed. The qq computation is very simple but can be precomputed because both co2_comp_point and co2_half_sat are constants.

The ci computation is highly dependent on the co2 variable which comes from the met array. However, it is also dependent on pp and therefore cannot be pre computed.

The e0 computation is highly dependent on lai and therefore cannot be precomputed.

The computation for dec Is highly dependent on doy, which comes from a calculation between a value in the met array and a value from the deltat array. Similarly to the met array, the deltat array contains values for different time steps but is constant between the calculations of the samples requested.

The mult computation is dependent on lat (the site latitude) which is read once from the input file, dec which is explained above, and deg_tor_rad which is a constant.

The computation of pd is dependent on gc and co2 as described above. The co2 variable comes from the met array, and therefore can be computed for each time step once.

The cps is calculated by using the e0, pd and radiation variables. As described above, e0 and pd variables are computed within the ACM itself. The radiation variable is retrieved from the met array and therefore, has the same characteristics already described. Because on dependencies on e0, the values cannot be precomputed.

The final calculation of ACM depends on cps, dayl_coef, day1 and day1_const. The source of the cps value calculation is described above, the dayl_coef and dayl_const are constants, while dayl is calculated within ACM based on mult.

5.2.2.3 Pre computations In order to reduce the number of operations performed in the ACM function, some values which could be evaluated less frequently during the whole execution of CARDAMOM were precomputed. This was performed in two different set of tests; one dealing with precomputations on variables which are constant during the running of CARDAMOM (i.e. remain the same for each time step and solution requested), and another test which deals with computations that are more variable, mainly those which vary from time step to another but remain consistent between the different solutions requested.

In the first set of tests, variables which are constant throughout the run of CARDAMOM (i.e. remain the same for each time step and solution requested) were precomputed. The values changed were:

 The abs(deltaWP)**(hydraulic_exponent) and hydraulic_temp_coef*Rtot parts within the gc assignment.

40

 The nit*NUE used in the pn assignment  The lai*lai was also computed in the first parts of the ACM function.

This was performed by first declaring three saved variables: nitnuemultipliedbynai, abs_deltawp_exp_hydraulicexponent and hydraulic_temp_coef_multiplied_rtot (as shown in Figure 5-14). These, in turn, were assigned once in a pre-existing branch which checks that values are computed once during the whole execution of CARDAMOM (branch checking if tmp_x shown in Figure 5-15).

!nit and nue are constants double precision, save:: nitnuemultipliedbynai !deltawp and hydraulic exponent are contants double precision, save :: abs_deltawp_exp_hydraulicexponent !hydraulic_temp_coef and Rtot are constants double precision, save :: hydraulic_temp_coef_multiplied_rtot

Figure 5-14 The declaration of three saved variables nitmultipliedbynai, abs_deltawp_exp_hydraulicexponent, hydraulic_temp_coef_multiplied_rtot used to precompute values for the ACM function

! calculate some values once as these are invariant between DALEC runs if (.not.allocated(tmp_x)) then !nit and nue are constants nitnuemultipliedbynai = (gpppars(4) * constants(1)) !deltawp and hydraulic exponent are contants abs_deltawp_exp_hydraulicexponent = (abs(gpppars(9))**(constants(10))) !hydraulic_temp_coef and Rtot are constants hydraulic_temp_coef_multiplied_rtot = (constants(6)* gpppars(10))

Figure 5-15 The assignment of nitnuemultipliedbynai, abs_deltawp_exp_hydraulicexponent and hydraulic_temp_coef_multiplied_rtot

Furthermore, within the ACM function, the lai*lai was computed earlier and assigned to a new variable named laiSquarred (see Figure 5-16).

laiSquared = lai*lai Figure 5-16 Precomputing lai*lai in order to reduce computations and increasing pipeline efficiency

The changes to use precomputations within ACM are shown and highlighted in Figure 5-17.

! daily canopy conductance, of CO2 or H2O? gc=abs_deltawp_exp_hydraulicexponent/((hydraulic_temp_coef_multiplied_rtot+trange) ) ! maximum rate of temperature and nitrogen (canopy efficiency) limited photosynthesis (gC.m-2.day-1) pn=lai*nitnuemultipliedbynai*exp(temp_exponent*maxt) ! pp and qq represent limitation by diffusion and metabolites respecitively pp=pn/gc ; qq=co2_comp_point-co2_half_sat ! calculate internal CO2 concentration (ppm) 41

ci=0.5*(co2+qq-pp+sqrt((co2+qq-pp)**2-4.0*(co2*qq-pp*co2_comp_point))) ! limit maximum quantium efficiency by leaf area, hyperbola e0=lai_coef*(laiSquared)/((laiSquared)+lai_const)

Figure 5-17 The parts of ACM where parts of the variable assignments were replaced with precomputed values

Furthermore, another test was created to compute the values one time per MHMCMC step taken (instead of one time for the whole running of CARDAMOM). This was performed by placing the assignments described within Figure 5-15 outside the branch and by removing the saved from the variable declarations.

The results shown in Figure 5-18 indicate that when these precomputations are undertaken once for each sample, there was a slight performance increase (based on single seed). However, pre-computing the values for one time during the whole execution of the application resulted in a substantial performance degradation - this is assumed to be caused by a less optimal memory access.

500.00

450.00

400.00

350.00

300.00

250.00

200.00

150.00 Average Time Taken (s) Taken Time Average 100.00

50.00

0.00 Pre-computing for each Original code Pre-computing once sample Random Seed 380.92 382.68 421.63 Single Seed 402.95 401.20 441.58 Run Code

Figure 5-18 Results showing the time taken to run CARDAMOM with (a) running the original code (b) some computations of ACM being computed outside the function(c) some computations of ACM being computed once for the whole of the loop against (c) , results show that by precomputing once the performance decreases significantly, while precomputing at each MHMCMC step the performance increases slightly.

In the second set of tests, parts of the ACM function (see Figure 5-13) computations which were dependent on variables computed only once for each time step (refer to section 5.2.2.2), were precomputed once for each time step instead of being recomputed each time for each step of the MHMCMC chain (i.e. 1*120time steps , instead of 1,000,000*120 time steps).

42

The variables used within the ACM function that have been precomputed or partially precomputed are as follows:

 Trange totally precomputed  GC totally precomputed  Pn precomputed the exponential part  Mult precomputed the tangent part

In order to apply these precomputations, some code changes had to be performed. First, a two dimensional array was declared (see Figure 5-19) to hold the precomputed values. The first dimension was used to represent the elements of the precomputations i.e. Trange, GC, Pn and Mult, while the second dimension represented the time step. Since the number of time steps differ depending on the parameter input file given to CARDAMOM, the array declared had to be dynamically sized. Furthermore, the arrays had to be stored throughout the steps taken within the MHMCMC. Hence, the array declared was declared to be allocatable.

In order to compute the Trange and the GC values, two new functions were declared as shown in Figure 5-20 and Figure 5-21 respectively. These functions use the same expression used within the ACM function to calculate GC and Trange.

In order to precompute the values, a branch was added in the Carbon_Model sub procedure. This branch was used to check if the array acmOptimisationArray had to be allocated in order to allocate and precompute values when needed, as shown in Figure 5-22.

double precision, allocatable, dimension(:,:) ::acmOptimisationArray Figure 5-19 Declaration of the two dimensional array "acmOptimisationArray" used to store precomputed values used within the ACM functions, first dimension will represent one of the 4 parameters (‘Trange’, ‘GC’, parts of ‘PN’, and parts of ‘mult’), the second dimension will represent the day.

double precision pure function calculcateTrange(maxt, mint) double precision, intent(in) :: maxt, mint calculcateTrange = 0.5 * (maxt-mint) return end function calculcateTrange

Figure 5-20 Declaration of "calculateTrange" function which computes the trange value as previously done within the ACM function

double precision pure function generateGC(deltaWP, hydraulic_exponent, hydraulic_temp_coef,Rtot,trange) double precision, intent(in) ::deltaWP, hydraulic_exponent, & hydraulic_temp_coef,Rtot,trange generateGC=0.0 generateGC=abs(deltaWP)**(hydraulic_exponent)/((hydraulic_temp_coef*Rtot+trange)) return

43

end function generateGC

Figure 5-21 Declaration of the "generateGC" function which computes the GC as previously done within the ACM function.

if (.not.allocated(acmOptimisationArray)) then allocate(acmOptimisationArray(4, nodays)) do n=1, nodays !Pre-compute Trange acmOptimisationArray(1,n)=calculcateTrange(met(3,n), met(2,n)) !Pre-compute GC (gpppars(9) and gpppars(10) are constants) acmOptimisationArray(2,n) = generateGC(gpppars(9), constants(10), constants(6),gpppars(10), acmOptimisationArray(1,n)) !pre-compute part of PN acmOptimisationArray(3,n) = exp(constants(8) * met(3,n)) !pre-compute part of mult acmOptimisationArray(4,n)=tan(lat*deg_to_rad) end do end if Figure 5-22 Precomputations of Trange, GC and precomputation of parts of the pn and mult expressions used within the ACM function. The precomputations are done ones, however these are calculated for each day that CARDAMOM is computing as the met array (used for the calculation of Trange and Pn) can differ for different days but is statically defined within the input file.

The precomputed values were then used within the ACM, as shown in Figure 5-23. Apart from the values precomputed outside the ACM function, a new expression computing lai*lai was placed in the beginning of the function to fill up the pipeline with more computations before the division operations and the square root function which is known to have bigger overhead and latency [17].

! determine temperature range trange=acmOptimisationArray(1) ! daily canopy conductance, of CO2 or H2O? gc=acmOptimisationArray(2)

… (declaration of variables etc)

laiSquared=lai*lai ! maximum rate of temperature and nitrogen (canopy efficiency) limited photosynthesis (gC.m-2.day-1) pn=lai*nit*NUE*acmOptimisationArray(3) ! pp and qq represent limitation by diffusion and metabolites respecitively pp=pn/gc ; qq=co2_comp_point-co2_half_sat ! calculate internal CO2 concentration (ppm) ci=0.5*(co2+qq-pp+sqrt((co2+qq-pp)**2-4.0*(co2*qq-pp*co2_comp_point))) ! limit maximum quantium efficiency by leaf area, hyperbola e0=lai_coef*(laiSquared)/((laiSquared)+lai_const) ! calculate day length (hours) ! This is the old REFLEX project calculation but it is wrong so anyway here ! we go... dec=-23.4*cos((360*(doy+10.)/365)*deg_to_rad)*deg_to_rad

44

mult=acmOptimisationArray(4)*tan(dec)

Figure 5-23 The parts of the ACM function changed to use precomputed values (highlighted), trange was assigned the precomputed value, gc is also assigned a precomputed value, parts of pn have been precomputed, lai*lai has been computed before within the ACM function and used within the e0 assignment.

The change was then tested on Eddie. The results (shown in Figure 5-24) show that this change contributed to a positive performance gain of ~2.37% based on the single seed benchmark while a higher ~7.28% was observed when using the random seed benchmark.

450.00

400.00

350.00

300.00

250.00

200.00

150.00 Average Time Taken (s) Taken Time Average 100.00

50.00

0.00 Original Code ACM with Pre-computations Random Seed 380.92 366.68 Single Seed 402.95 393.39 Code run

Figure 5-24 Average time taken to run CARDAMOM using (a) the original source code (b) source code modified to precompute gc, trange and parts of e0 and mult used within the ACM function. The results show that by doing precomputations of gc, trange and parts of e0 and mult the performance increased drastically

5.2.3 Extracting function from the main Carbon_Model loop In order to try to reduce the amount of code within the code cache, parts of the loop which are not executed a lot of times (i.e. are within a branch in the loop), more specifically the section dealing with litter and the section dealing with fire, were extracted into two different subroutines. Three different tests were then performed:

1. The procedures were not given any directives 2. The procedures calls were annotated with Directives5 (for Intel Fortran Compiler) not to be inlined.

5 https://software.intel.com/en-us/node/580762#622C864A-A269-4A5F-8ECF-245EA8A29C6C 45

3. The procedures calls were annotated with Directives5 (for Intel Fortran Compiler) not to be inlined and the ACM call used the most was annotated to be forced inlined.

The results shown in Figure 5-25 show that by extracting the functions, there was little effect on the overall performance of the application, and by enforcing no inlining, the performance was slightly degraded. The forcing of inlining of ACM together with no inlining of the new two sub procedures gave conflicting results between the single seed benchmark and the random seed benchmark.

420.00

410.00

400.00

390.00

380.00

370.00

360.00

350.00

340.00

330.00

320.00 Extracted Functions No Extracted Functions Extracted Functions Original Code inline on new functions, without specifying Forced No Inline ACM forced Inlined directives Single Seed 402.95 405.75 403.84 401.48 Random Seed 380.92 385.25 379.78 384.26

Figure 5-25 Results of running the code with a) no changes, b) extracting sub procedures from the main loop and enforcing the compiler to not inline the new procedure by the use of directives, c) extracting sub procedures enforcing no inlining on the new sub procedures and forcing inlining on ACM, and c) extracting sub procedures without using any directives (either inlining or no inlining)

5.2.4 Calculate growing session index code block The calculate growing session index code block (shown in Figure 5-26) is a minor hotspot within CARDAMOM, consuming around ~3.2% of the total running time. It was identified that the reciprocal multiplication described already in section 5.2.1.1 can be used to optimise the division operations done within the calculate growing session index code block.

! temperature limitation, then restrict to 0-1; correction for k-> oC Tfac = (met(10,n)-(pars(14)-273.15)) / (pars(15)-pars(14)) Tfac = min(1d0,max(0d0,Tfac)) ! photoperiod limitation Photofac = (met(11,n)-pars(16)) / (pars(24)-pars(16)) Photofac = min(1d0,max(0d0,Photofac)) 46

! VPD limitation VPDfac = 1.0 - ( (met(12,n)-pars(25)) / (pars(26)-pars(25)) ) VPDfac = min(1d0,max(0d0,VPDfac))

! calculate and store the GSI index FLUXES(n,18) = Tfac*Photofac*VPDfac

! Possible modifications to avoid new -> old when on upward trend: ! 3) GSI gradient ! These may require new parameter to consider the original condition?

! we will load up some needed variables m = tmp_m(n) ! update gsi_history for the calculation ! gsi_history((gsi_lag-m):gsi_lag) = FLUXES((n-m):n,18)

if (n == 1) then ! in first step only we want to take the initial GSI value only gsi_history(gsi_lag) = FLUXES(n,18) else gsi_history((gsi_lag-m):gsi_lag) = FLUXES((n-m):n,18) end if

Figure 5-26 The calculate growing session index code block

All the pars values are changed only ones per each step taken in the MHMCMC (i.e. not in each time step within the Carbon_Model loop). Therefore, the reciprocal multiplier can be computed once for the entirety of the time steps taken. Three different variables were declared for the divisor of Tfac, Photofac and VPDfac, and the reciprocal multiplier was computed before the start of the main loop within the Carbon_Model. Then each division operation was replaced with a multiplication as shown in Figure 5-27.

! temperature limitation, then restrict to 0-1; correction for k-> oC Tfac = (met(10,n)-(pars(14)-273.15)) *tfacDivisor Tfac = min(1d0,max(0d0,Tfac)) ! photoperiod limitation Photofac = (met(11,n)-pars(16)) *photofacDivisor Photofac = min(1d0,max(0d0,Photofac)) ! VPD limitation VPDfac = 1.0 - ( (met(12,n)-pars(25)) *VPDfacDivisor ) VPDfac = min(1d0,max(0d0,VPDfac))

Figure 5-27 Changes performed to the calculate growing session index code block

The code was then compiled and executed on EDDIE and the results are shown in Figure 5-28. The result shows a minor performance improvement both when using the single seed benchmark as well as when using the random seed benchmark.

47

420.00

410.00

400.00

390.00

380.00

370.00

360.00

350.00

Average Time Taken(s) Time Average 340.00

330.00

320.00

310.00 Reciprocal Multiplication in GSI Original Code Single Seed 402.00 402.95 Random Seed 380.00 380.92 Code Run

Figure 5-28 Average time taken to run CARDAMOM with (a) division operation replaced with reciprocal multiplication within the calculate growing session index code block, and (b) the original code

5.3 THE EDC2_GSI SUBROUTINE OPTIMISATION The EDC2_GSI subroutine is a minor hotspot within CARDAMOM, consuming ~9.1% of the total computation time. The EDC2_GSI subroutine is mainly composed of a series of branches after each other that check the various parameters against ecological constraints.

Branches introduce some limitations to the amount of out of order execution the CPU can do and therefore, CPUs usually use branch prediction to continue processing data, clearing the instructions retired if any branch is mispredicted. Therefore, by reducing the amount of branches or by helping the branch predictor to predict the branches correctly, one can usually decrease the amount of machine clears the CPU needs to do.

Most of the branches within EDC2_GSI check in a subsequent fashion whether either the EDC2 or DIAG variables are set to true. The EDC2 variable is set in the beginning of the EDC2_GSI subroutine to true, and it is set to false once any parameter fails to meet the ecological constraint criteria. The DIAG flag is set from the input file given to CARDAMOM in the beginning of the execution.

if ((EDC2 == 1.or. DIAG == 1) .and. (some ecological constraint failed) then EDC2=0; EDCD%PASSFAIL(ecological constraint index)=0 endif Figure 5-29 Example of the branches used within the EDC2_GSI function

48

It was determined that the code could be changed so that the DIAG flag is checked only once in the beginning of the EDC2_GSI subroutine. When the DIAG flag is true, the part of the branches which check whether EDC2 or DIAG are true can be completely removed, since DIAG would be known to have a value of true. On the other hand, when the DIAG value is set to false, the branches checking whether EDC2 or DIAG are true can be changed to check for EDC2 only since DIAG would be known to have a value of false.

if (DIAG == 1) then

!Diag is already known to be true so no need to check DIAG or EDC2 if (some ecological constraint failed) then EDC2=0 ; EDCD%PASSFAIL(5)=0 endif

… more ecological constraints checks

else ! DIAG is already known to be false so no need to check DIAG if ((EDC2 == 1) .and. (some ecological constraint failed) then EDC2=0 ; EDCD%PASSFAIL(20)=0 endif

… more ecological constraints checks

end if

Figure 5-30 Pseudo code illustrating how the change was applied

Two tests were performed one by placing the changes in the same function, and the other one by placing all the code in a new function. The results shown in Figure 5-31 show insignificant code run time improvement when writing the changes in the same procedure, and a significant decrease in performance when extracting the change into a new function.

49

440.00

420.00

400.00

380.00

360.00

340.00 Average Time Taken (s) Taken Time Average

320.00

300.00 GSI Change with change GSI change with change Original Code inlined extracted into function Random Seed 378.35 380.92 389.69 Single Seed 402.95 402.95 405.54 Code run

Figure 5-31 Average time taken to run CARDAMOM with (a) GSI if statement simplified, with new branch in the same function (b) original code (c) GSI branch simplified and extracted into a new function

5.4 MERGING THE OPTIMISATIONS TOGETHER The changes described in the previous section were performed independently of each other. In this section, it will be discussed how these changes were merged and which optimisations were chosen.

The considerations taken when applying each change were mainly four:

 The effect on performance by applying a change.  The effect on parts of the code base which are reused in other code bases (i.e. performance gain in CARDAMOM can lead to performance or maintainability problems in other projects).  The effects on readability and maintainability of the code block being changed.  Some optimisations discussed before cannot be implemented together mainly because some of the tests described were analysing an optimal way to resolve a particular problem.

The approach taken to merge the optimisations was to identify the changes that performed best in both the random seed benchmarks and the single seed benchmarks, and after each change a test was run. The changes chosen to merge together were (in the order they were merged):

1. To precompute values used in the ACM function as described later in section 5.2.2.3 (second set of tests) 2. To replace the division operations of Deltat(n) used especially within the Calculate time and temperature dependencies code block with a reciprocal multiplication (see section 5.2.1.1) 50

3. To replicate the FLUXES array writes into a temporary array in order rearrange the array for the subsequent reads (see section 5.2.1.2) 4. To replace the division operations within the calculate growing session index code block with the reciprocal multiplication (see section 5.2.4) 5. The GSI_EDC2 optimisation which replicates a branch within the same function was also tested. However, due to the low benefit and problems to code maintainability this change is not suggested.

Figure 5-32 shows the results of different set of changes. The set of changes are based on the list above (ie. Merge up to 3 means changes described in list item 1 + list item 2 + list item 3 above). Fom the results, it can be seen that the best combination is to use all the changes except the change performed with GSI_EDC2 (change 5 in the list above). The accumulative results show a performance increase of ~5- 6%.

Furthermore, by analysing the results of merge up to change 2, in change 3 one can see a minor degradation. However, when change 4 is applied, the performance goes higher than the ~0.9 seconds difference observed in the tests to do a reciprocal multiplication described in section 5.2.4. The unexpected time difference within merge 3 to merge 4 is thought to be caused by side effects of the changes.

51

450.00

400.00

350.00

300.00

250.00

200.00 Average time taken (s) taken timeAverage

150.00

100.00

50.00

0.00 Merge up to 2 Merge up to 3 Merge up to 4 Merge up to 5 Original Code Single Seed 385.94 387.95 382.09 382.81 402.95 Random Seed 363.75 373.14 360.22 363.60 380.92 Merge Set Result

Figure 5-32 Average time to run CARDAMOM using different accumulated changes compared with the Original Code 6 CONCLUSIONS

The objective of this project was to increase the efficiency of the running of CARDAMOM especially on the EDDIE3 platform. Although only ~5-6% efficiency was achieved from manual code optimisation, considering the thousands of CARDAMOM instances run, the benefit of such optimisations is significant. Furthermore, during this project, further efficiency was achieved by finding the appropriate compilation strategy. This gave at least ~15% performance increase when compared with the compilation strategy used before to compile CARDAMOM (-02).

Once the compilation strategy was identified, profiling and hotspot analysis were undertaken. These showed that most of the execution time in CARDAMOM is spent in a single loop, and more specifically, in two main code blocks within that loop.

The two code blocks identified to be the major hotspots had issues relating to high utilisation of the divider, problems with the memory access patterns and high utilisation of the load and store unit. Different techniques to optimise the procedures were undertaken. The most successful changes were related to precomputations and reductions of division and other expensive operations. Another change performed was to increase the locality of the reads within the POOLS array. This gave limited positive results. However, this is expected to give larger gains when the number of time steps used in the simulation is larger because the difference in the locality of the arrays would be larger.

Further to the objective, the project also generated some useful information about the behaviour of CARDAMOM, such as the observed time difference range between using different seeds.

7 FUTURE WORK

This project aimed to achieve more efficiency in running CARDAMOM through code optimisation. However, it was observed that there were little opportunities for further serial code optimisation. Therefore, investigation of other ways to achieve efficiency and more accurate results could produce more beneficial results.

Such investigations could target reducing any unnecessary computations done through different concurrent executions of CARDAMOM and introducing methods to detect or pre-detect any occurrence of chain blocking.

APPENDIX A – CODE OF THE CARBON MODEL SUB PROCEDURE

module CARBON_MODEL_MOD

implicit none

! make all private private

! explicit publics public :: CARBON_MODEL & ,itemp,ivpd,iphoto& ,extracted_C & ,dim_1,dim_2 & ,nos_trees & ,nos_inputs & ,leftDaughter & ,rightDaughter & ,nodestatus & ,xbestsplit & ,nodepred & ,bestvar

! ACM related parameters double precision, parameter :: pi = 3.1415927 double precision, parameter :: deg_to_rad = pi/180.0

! forest rotation specific info double precision, allocatable, dimension(:) :: extracted_C,itemp,ivpd,iphoto

! arrays for the emulator, just so we load them once and that is it cos they be ! massive integer :: dim_1, & ! dimension 1 of response surface dim_2, & ! dimension 2 of response surface nos_trees, & ! number of trees in randomForest nos_inputs ! number of driver inputs

integer :: gsi_lag double precision, allocatable, dimension(:) :: tmp_x, tmp_m double precision, allocatable, dimension(:,:) :: leftDaughter, & ! left daughter for forest rightDaughter, & ! right daughter for forets nodestatus, & ! nodestatus for forests xbestsplit, & ! for forest nodepred, & ! prediction value for each tree bestvar ! for randomForests contains ! !------! subroutine CARBON_MODEL(met,pars,deltat,nodays,lat,lai,NEE,FLUXES,POOLS &

54

,nopars,nomet,nopools,nofluxes,GPP)

! The Data Assimilation Linked Ecosystem Carbon - Growing Season ! Index - Forest Rotation (DALEC_GSI_FR) model. ! The subroutine calls the Aggregated Canopy Model to simulate GPP and ! partitions between various ecosystem carbon pools. These pools are ! subject to turnovers / decompostion resulting in ecosystem phenology and fluxes of CO2

implicit none

! declare input variables integer, intent(in) :: nopars & ! number of paremeters in vector ,nomet & ! number of meteorological fields ,nofluxes & ! number of model fluxes ,nopools & ! number of model pools ,nodays ! number of days in simulation

double precision, intent(in) :: met(nomet,nodays) & ! met drivers ,deltat(nodays) & ! time step in decimal days ,pars(nopars) & ! number of parameters ,lat ! site latitude (degrees)

double precision, dimension(nodays), intent(inout) :: lai & ! leaf area index ,GPP & ! Gross primary productivity ,NEE ! net ecosystem exchange of CO2

double precision, dimension((nodays+1),nopools), intent(inout) :: POOLS ! vector of ecosystem pools

double precision, dimension(nodays,nofluxes), intent(inout) :: FLUXES ! vector of ecosystem fluxes

! declare general local variables double precision :: gpppars(12) & ! ACM inputs (LAI+met) ,constants(10) ! parameters for ACM

integer :: p,f,nxp,n,test,m

! local fire related variables double precision :: CFF(6) = 0, CFF_res(4) = 0 & ! combusted and non-combustion fluxes ,NCFF(6) = 0, NCFF_res(4) = 0 & ! with residue and non-residue seperates ,combust_eff(5) & ! combustion efficiency ,rfac ! resilience factor

! local deforestation related variables double precision, dimension(4) :: post_harvest_burn & ! how much burning to occur after ,foliage_frac_res & ,roots_frac_res & ,rootcr_frac_res & ,stem_frac_res &

55

,branch_frac_res & ,Cbranch_part & ,Crootcr_part & ,soil_loss_frac double precision :: labile_loss,foliar_loss & ,roots_loss,wood_loss & ,labile_residue,foliar_residue& ,roots_residue,wood_residue & ,wood_pellets,C_total & ,labile_frac_res & ,Cstem,Cbranch,Crootcr & ,stem_residue,branch_residue & ,coarse_root_residue & ,soil_loss_with_roots integer :: reforest_day, harvest_management,restocking_lag

! local variables for GSI phenology model double precision :: Tfac,Photofac,VPDfac & ! oC, seconds, Pa ,delta_gsi,tmp,gradient & ,fol_turn_crit,lab_turn_crit & ,gsi_history(22),just_grown

! met drivers are: ! 1st run day ! 2nd min daily temp (oC) ! 3rd max daily temp (oC) ! 4th Radiation (MJ.m-2.day-1) ! 5th CO2 (ppm) ! 6th DOY ! 7th lagged precip ! 8th deforestation fraction ! 9th burnt area fraction ! 10th 21 day average min temperature (K) ! 11th 21 day average photoperiod (sec) ! 12th 21 day average VPD (Pa) ! 13th Forest management practice to accompany any clearing

! POOLS are: ! 1 = labile (p18) ! 2 = foliar (p19) ! 3 = root (p20) ! 4 = wood (p21) ! 5 = litter (p22) ! 6 = som (p23)

! p(30) = labile replanting ! p(31) = foliar replanting ! p(32) = fine root replanting ! p(33) = wood replanting

! FLUXES are: ! 1 = GPP ! 2 = temprate ! 3 = respiration_auto

56

! 4 = leaf production ! 5 = labile production ! 6 = root production ! 7 = wood production ! 8 = labile production ! 9 = leaffall factor ! 10 = leaf litter production ! 11 = woodlitter production ! 12 = rootlitter production ! 13 = respiration het litter ! 14 = respiration het som ! 15 = litter2som ! 16 = labrelease factor ! 17 = carbon flux due to fire ! 18 = growing season index

! PARAMETERS ! 17+4(GSI) values

! p(1) Litter to SOM conversion rate - m_r ! p(2) Fraction of GPP respired - f_a ! p(3) Fraction of NPP allocated to foliage - f_f ! p(4) Fraction of NPP allocated to roots - f_r ! p(5) max leaf turnover (GSI) ! Leaf lifespan - L_f (CDEA) ! p(6) Turnover rate of wood - t_w ! p(7) Turnover rate of roots - t_r ! p(8) Litter turnover rate - t_l ! p(9) SOM turnover rate - t_S ! p(10) Parameter in exponential term of temperature - \theta ! p(11) Canopy efficiency parameter - C_eff (part of ACM) ! p(12) = max labile turnover(GSI) ! date of Clab release - B_day (CDEA) ! p(13) = Fraction allocated to Clab - f_l ! p(14) = min temp threshold (GSI) ! lab release duration period - R_l (CDEA) ! p(15) = max temp threshold (GSI)! date of leaf fall - F_day ! p(16) = min photoperiod threshold (GIS) ! p(17) = LMA ! p(24) = max photoperiod threshold (GSI) ! p(25) = min VPD threshold (GSI) ! p(26) = max VPD threshold (GSI) ! p(27) = minimum GPP benefit of increased LAI for labile allocation to be allowed ! p(28) = fraction of Cwood which is Cbranch ! p(29) = fraction of Cwood which is Ccoarseroot

! variables related to deforestation ! labile_loss = total loss from labile pool from deforestation ! foliar_loss = total loss form foliar pool from deforestation ! roots_loss = total loss from root pool from deforestation ! wood_loss = total loss from wood pool from deforestation ! labile_residue = harvested labile remaining in system as residue ! foliar_residue = harested foliar remaining in system as residue ! roots_residue = harvested roots remaining in system as residue ! wood_residue = harvested wood remaining in system as residue

57

! coarse_root_residue = expected coarse woody root left in system as residue

! parameters related to deforestation ! labile_frac_res = fraction of labile harvest left as residue ! foliage_frac_res = fraction of foliage harvest left as residue ! roots_frac_res = fraction of roots harvest left as residue ! wood_frac_res = fraction of wood harvest left as residue ! Crootcr_part = fraction of wood pool expected to be coarse root ! Crootcr_frac_res = fraction of coarse root left as residue ! soil_loss_frac = fraction determining Csom expected to be physically ! removed along with coarse roots

! load some values gpppars(4) = 1.0 ! 10d0**(pars(11)) !TLS 1 ! foliar N gpppars(7) = lat gpppars(9) = -2.0 ! leafWP-soilWP gpppars(10) = 1.0 ! totaly hydraulic resistance gpppars(11) = pi

! assign acm parameters constants(1)=pars(11) constants(2)=0.0156935 constants(3)=4.22273 constants(4)=208.868 constants(5)=0.0453194 constants(6)=0.37836 constants(7)=7.19298 constants(8)=0.011136 constants(9)=2.1001 constants(10)=0.789798

! assigning initial conditions POOLS(1,1)=pars(18) POOLS(1,2)=pars(19) POOLS(1,3)=pars(20) POOLS(1,4)=pars(21) POOLS(1,5)=pars(22) POOLS(1,6)=pars(23)

! initial values for deforestation variables labile_loss = 0. ; foliar_loss = 0. roots_loss = 0. ; wood_loss = 0. labile_residue = 0. ; foliar_residue = 0. roots_residue = 0. ; wood_residue = 0. stem_residue = 0. ; branch_residue = 0. reforest_day = 0 soil_loss_with_roots = 0. coarse_root_residue = 0. post_harvest_burn = 0.

! now load the hardcoded forest management parameters into their locations

! Parameter values for deforestation variables

58

! scenario 1 ! harvest residue (fraction); 1 = all remains, 0 = all removed foliage_frac_res(1) = 1.0 roots_frac_res(1) = 1.0 rootcr_frac_res(1) = 1.0 branch_frac_res(1) = 1.0 stem_frac_res(1) = 0. ! ! wood partitioning (fraction) Crootcr_part(1) = 0.32 ! Coarse roots (Adegbidi et al 2005; ! Black et al 2009; Morison et al 2012) Cbranch_part(1) = 0.20 ! (Ares & Brauers 2005) ! actually < 15 years branches = ~25 % ! > 15 years branches = ~15 %. ! Csom loss due to phyical removal with roots ! Morison et al (2012) Forestry Commission Research Note soil_loss_frac(1) = 0.02 ! actually between 1-3 % ! was the forest burned after deforestation post_harvest_burn(1) = 1.

!## scen 2 ! harvest residue (fraction); 1 = all remains, 0 = all removed foliage_frac_res(2) = 1.0 roots_frac_res(2) = 1.0 rootcr_frac_res(2) = 1.0 branch_frac_res(2) = 1.0 stem_frac_res(2) = 0. ! ! wood partitioning (fraction) Crootcr_part(2) = 0.32 ! Coarse roots (Adegbidi et al 2005; ! Black et al 2009; Morison et al 2012) Cbranch_part(2) = 0.20 ! (Ares & Brauers 2005) ! actually < 15 years branches = ~25 % ! > 15 years branches = ~15 %. ! Csom loss due to phyical removal with roots ! Morison et al (2012) Forestry Commission Research Note soil_loss_frac(2) = 0.02 ! actually between 1-3 % ! was the forest burned after deforestation post_harvest_burn(2) = 0.

!## scen 3 ! harvest residue (fraction); 1 = all remains, 0 = all removed foliage_frac_res(3) = 0.5 roots_frac_res(3) = 1.0 rootcr_frac_res(3) = 1.0 branch_frac_res(3) = 0. stem_frac_res(3) = 0. ! ! wood partitioning (fraction) Crootcr_part(3) = 0.32 ! Coarse roots (Adegbidi et al 2005; ! Black et al 2009; Morison et al 2012) Cbranch_part(3) = 0.20 ! (Ares & Brauers 2005) ! actually < 15 years branches = ~25 % ! > 15 years branches = ~15 %. ! Csom loss due to phyical removal with roots ! Morison et al (2012) Forestry Commission Research Note soil_loss_frac(3) = 0.02 ! actually between 1-3 % ! was the forest burned after deforestation post_harvest_burn(3) = 0.

59

!## scen 4 ! harvest residue (fraction); 1 = all remains, 0 = all removed foliage_frac_res(4) = 0.5 roots_frac_res(4) = 1.0 rootcr_frac_res(4) = 0. branch_frac_res(4) = 0. stem_frac_res(4) = 0. ! ! wood partitioning (fraction) Crootcr_part(4) = 0.32 ! Coarse roots (Adegbidi et al 2005; ! Black et al 2009; Morison et al 2012) Cbranch_part(4) = 0.20 ! (Ares & Brauers 2005) ! actually < 15 years branches = ~25 % ! > 15 years branches = ~15 %. ! Csom loss due to phyical removal with roots ! Morison et al (2012) Forestry Commission Research Note soil_loss_frac(4) = 0.02 ! actually between 1-3 % ! was the forest burned after deforestation post_harvest_burn(4) = 0.

! for the moment override all paritioning parameters with those coming from ! CARDAMOM Cbranch_part = pars(28) Crootcr_part = pars(29)

! declare fire constants (labile, foliar, roots, wood, litter) combust_eff(1) = 0.1 ; combust_eff(2) = 0.9 combust_eff(3) = 0.1 ; combust_eff(4) = 0.5 combust_eff(5) = 0.3 ; rfac = 0.5

! assign climate sensitivities fol_turn_crit=pars(34)-1d0 lab_turn_crit=pars(3)-1d0 just_grown=pars(35)

! calculate some values once as these are invarient between DALEC runs if (.not.allocated(tmp_x)) then ! 21 days is the maximum potential so we will fill the maximum potential ! + 1 for safety allocate(tmp_x(22),tmp_m(nodays)) do f = 1, 22 tmp_x(f) = f end do do n = 1, nodays ! calculate the gradient / trend of GSI if (sum(deltat(1:n)) < 21) then tmp_m(n) = n-1 else ! else we will try and work out the gradient to see what is ! happening ! to the system over all. The default assumption will be to

60

! consider ! the averaging period of GSI model (i.e. 21 days). If this is not ! possible either the time step of the system is used (if step ! greater ! than 21 days) or all available steps (if n < 21). m = 0 ; test = 0 do while (test < 21) test = sum(deltat((n-m):n)) m=m+1 if (m > (n-1)) test = 21 end do tmp_m(n) = m end if ! for calculating gradient end do ! calc daily values once ! allocate GSI history dimension gsi_lag=max(2,maxval(nint(tmp_m))) end if ! .not.allocated(tmp_x) ! assign our starting value gsi_history = pars(36)-1d0

! ! Begin looping through each time step !

do n = 1, nodays

! calculate LAI value lai(n)=POOLS(n,2)/pars(17)

! load next met / lai values for ACM gpppars(1)=lai(n) gpppars(2)=met(3,n) ! max temp gpppars(3)=met(2,n) ! min temp gpppars(5)=met(5,n) ! co2 gpppars(6)=ceiling(met(6,n)-(deltat(n)*0.5)) ! doy gpppars(8)=met(4,n) ! radiation

! GPP (gC.m-2.day-1) if (lai(n) > 0.) then FLUXES(n,1) = acm(gpppars,constants) else FLUXES(n,1) = 0. end if ! temprate (i.e. temperature modified rate of metabolic activity)) FLUXES(n,2) = exp(pars(10)*0.5*(met(3,n)+met(2,n))) ! autotrophic respiration (gC.m-2.day-1) FLUXES(n,3) = pars(2)*FLUXES(n,1) ! leaf production rate (gC.m-2.day-1) FLUXES(n,4) = 0. !(FLUXES(n,1)-FLUXES(n,3))*pars(3) ! labile production (gC.m-2.day-1) FLUXES(n,5) = (FLUXES(n,1)-FLUXES(n,3)-FLUXES(n,4))*pars(13) ! root production (gC.m-2.day-1)

61

FLUXES(n,6) = (FLUXES(n,1)-FLUXES(n,3)-FLUXES(n,4)- FLUXES(n,5))*pars(4) ! wood production FLUXES(n,7) = FLUXES(n,1)-FLUXES(n,3)-FLUXES(n,4)-FLUXES(n,5)- FLUXES(n,6)

! GSI added to fortran version by TLS 24/11/2014 ! /* 25/09/14 - JFE ! Here we calculate the Growing Season Index based on ! Jolly et al. A generalized, bioclimatic index to predict foliar ! phenology in response to climate Global Change Biology, Volume 11, page 619-632, ! 2005 (doi: 10.1111/j.1365-2486.2005.00930.x) ! Stoeckli, R., T. Rutishauser, I. Baker, M. A. Liniger, and A. S. ! Denning (2011), A global reanalysis of vegetation phenology, J. Geophys. Res., ! 116, G03020, doi:10.1029/2010JG001545.

! It is the product of 3 limiting factors for temperature, photoperiod and ! vapour pressure deficit that grow linearly from 0 to 1 between a calibrated ! min and max value. Photoperiod, VPD and avgTmin are direct input

! temperature limitation, then restrict to 0-1; correction for k- > oC Tfac = (met(10,n)-(pars(14)-273.15)) / (pars(15)-pars(14)) Tfac = min(1d0,max(0d0,Tfac)) ! photoperiod limitation Photofac = (met(11,n)-pars(16)) / (pars(24)-pars(16)) Photofac = min(1d0,max(0d0,Photofac)) ! VPD limitation VPDfac = 1.0 - ( (met(12,n)-pars(25)) / (pars(26)-pars(25)) ) VPDfac = min(1d0,max(0d0,VPDfac))

! calculate and store the GSI index FLUXES(n,18) = Tfac*Photofac*VPDfac

! Possible modifications to avoid new -> old when on upward trend: ! 3) GSI gradient ! These may require new parameter to consider the original condition?

! we will load up some needed variables m = tmp_m(n) ! update gsi_history for the calculation ! gsi_history((gsi_lag-m):gsi_lag) = FLUXES((n-m):n,18)

if (n == 1) then ! in first step only we want to take the initial GSI value only gsi_history(gsi_lag) = FLUXES(n,18) else

62

gsi_history((gsi_lag-m):gsi_lag) = FLUXES((n-m):n,18) end if ! calculate gradient gradient = linear_model_gradient(tmp_x(1:(gsi_lag)),gsi_history(1:gsi_lag),gsi_lag)

! adjust gradient to daily rate gradient = gradient / nint((sum(deltat((n-m+1):n))) / (gsi_lag- 1))

! possible modification to the gradient / GSI combined model ! 1) foliar and labile turnovers calculated as usual but the gradient ! determines which process is allowed to go forward ! 2) foliar loss only, determined by gradient control

! first assume that nothing is happening FLUXES(n,9) = 0.0 ! leaf turnover FLUXES(n,16) = 0.0 ! leaf growth

! now update foliage and labile conditions based on gradient calculations if (gradient < fol_turn_crit .or. FLUXES(n,18) == 0) then ! we are in a decending condition so foliar turnover FLUXES(n,9) = pars(5)*(1.0-FLUXES(n,18)) just_grown = 0.5 else if (gradient > lab_turn_crit) then ! we are in a assending condition so labile turnover FLUXES(n,16) = pars(12)*FLUXES(n,18) just_grown = 1.5 ! check carbon return tmp = POOLS(n,1)*(1d0-(1d0- FLUXES(n,16))**deltat(n))/deltat(n) tmp = (POOLS(n,2)+tmp)/pars(17) gpppars(1)=tmp tmp = acm(gpppars,constants) ! determine if increase in LAI leads to an improvement in GPP greater ! than ! critical value, if not then no labile turnover allowed if ( ((tmp - FLUXES(n,1))/FLUXES(n,1)) < pars(27) ) then FLUXES(n,16) = 0d0 end if else ! probaly we want nothing to happen, however if we are at the seasonal ! maximum we will consider further growth still if (just_grown >= 1.0) then ! we are between so definitely not losing foliage and we have ! previously been growing so maybe we still have a marginal return on ! doing so again FLUXES(n,16) = pars(12)*FLUXES(n,18)

63

! but possibly gaining some? ! determine if this is a good idea based on GPP increment tmp = POOLS(n,1)*(1d0-(1d0- FLUXES(n,16))**deltat(n))/deltat(n) tmp = (POOLS(n,2)+tmp)/pars(17) gpppars(1)=tmp tmp = acm(gpppars,constants) ! determine if increase in LAI leads to an improvement in GPP greater ! than ! critical value, if not then no labile turnover allowed if ( ((tmp - FLUXES(n,1))/FLUXES(n,1)) < pars(27) ) then FLUXES(n,16) = 0d0 end if end if ! Just grown? end if ! gradient choice

! these allocated if post-processing if (allocated(itemp)) then itemp(n) = Tfac ivpd(n) = VPDfac iphoto(n) = Photofac end if

! ! those with time dependancies !

! total labile release FLUXES(n,8) = POOLS(n,1)*(1.-(1.- FLUXES(n,16))**deltat(n))/deltat(n) ! total leaf litter production FLUXES(n,10) = POOLS(n,2)*(1.-(1.- FLUXES(n,9))**deltat(n))/deltat(n) ! total wood production FLUXES(n,11) = POOLS(n,4)*(1.-(1.-pars(6))**deltat(n))/deltat(n) ! total root litter production FLUXES(n,12) = POOLS(n,3)*(1.-(1.-pars(7))**deltat(n))/deltat(n)

! ! those with temperature AND time dependancies !

! respiration heterotrophic litter FLUXES(n,13) = POOLS(n,5)*(1.-(1.- FLUXES(n,2)*pars(8))**deltat(n))/deltat(n) ! respiration heterotrophic som FLUXES(n,14) = POOLS(n,6)*(1.-(1.- FLUXES(n,2)*pars(9))**deltat(n))/deltat(n) ! litter to som FLUXES(n,15) = POOLS(n,5)*(1.-(1.- pars(1)*FLUXES(n,2))**deltat(n))/deltat(n)

! calculate the NEE NEE(n) = (-FLUXES(n,1)+FLUXES(n,3)+FLUXES(n,13)+FLUXES(n,14)) ! load GPP

64

GPP(n) = FLUXES(n,1)

! ! update pools for next timestep !

! labile pool POOLS(n+1,1) = POOLS(n,1) + (FLUXES(n,5)-FLUXES(n,8))*deltat(n) ! foliar pool POOLS(n+1,2) = POOLS(n,2) + (FLUXES(n,4)-FLUXES(n,10) + FLUXES(n,8))*deltat(n) ! wood pool POOLS(n+1,4) = POOLS(n,4) + (FLUXES(n,7)-FLUXES(n,11))*deltat(n) ! root pool POOLS(n+1,3) = POOLS(n,3) + (FLUXES(n,6) - FLUXES(n,12))*deltat(n) ! litter pool POOLS(n+1,5) = POOLS(n,5) + (FLUXES(n,10)+FLUXES(n,12)- FLUXES(n,13)-FLUXES(n,15))*deltat(n) ! som pool POOLS(n+1,6) = POOLS(n,6) + (FLUXES(n,15)- FLUXES(n,14)+FLUXES(n,11))*deltat(n)

! ! deal first with deforestation !

if (n == reforest_day) then POOLS(n+1,1) = pars(30) POOLS(n+1,2) = pars(31) POOLS(n+1,3) = pars(32) POOLS(n+1,4) = pars(33) end if

if (met(8,n) > 0.) then

! pass harvest management to local integer harvest_management = int(met(13,n))

! assume that labile is proportionally distributed through the plant ! and therefore so is the residual fraction C_total = POOLS(n+1,2) + POOLS(n+1,3) + POOLS(n+1,4) ! partition wood into its components Cbranch = POOLS(n+1,4)*Cbranch_part(harvest_management) Crootcr = POOLS(n+1,4)*Crootcr_part(harvest_management) Cstem = POOLS(n+1,4)-(Cbranch + Crootcr) ! now calculate the labile fraction of residue labile_frac_res = ( (POOLS(n+1,2)/C_total) * foliage_frac_res(harvest_management) ) & + ( (POOLS(n+1,3)/C_total) * roots_frac_res(harvest_management) ) & + ( (Cbranch/C_total) * branch_frac_res(harvest_management) ) & + ( (Cstem/C_total) * stem_frac_res(harvest_management) ) &

65

+ ( (Crootcr/C_total) * rootcr_frac_res(harvest_management) )

! loss of carbon from each pools labile_loss = POOLS(n+1,1)*met(8,n) foliar_loss = POOLS(n+1,2)*met(8,n) roots_loss = POOLS(n+1,3)*met(8,n) wood_loss = POOLS(n+1,4)*met(8,n) ! transfer fraction of harvest waste to litter or som pools ! easy pools first labile_residue = POOLS(n+1,1)*met(8,n)*labile_frac_res foliar_residue = POOLS(n+1,2)*met(8,n)*foliage_frac_res(harvest_management) roots_residue = POOLS(n+1,3)*met(8,n)*roots_frac_res(harvest_management) ! explicit calculation of the residues from each fraction coarse_root_residue = Crootcr*met(8,n)*rootcr_frac_res(harvest_management) branch_residue = Cbranch*met(8,n)*branch_frac_res(harvest_management) stem_residue = Cstem*met(8,n)*stem_frac_res(harvest_management) ! now finally calculate the final wood residue wood_residue = stem_residue + branch_residue + coarse_root_residue ! mechanical loss of Csom due to coarse root extraction soil_loss_with_roots = Crootcr*met(8,n)*(1.- rootcr_frac_res(harvest_management)) & * soil_loss_frac(harvest_management)

! update living pools directly POOLS(n+1,1) = max(0.,POOLS(n+1,1)-labile_loss) POOLS(n+1,2) = max(0.,POOLS(n+1,2)-foliar_loss) POOLS(n+1,3) = max(0.,POOLS(n+1,3)-roots_loss) POOLS(n+1,4) = max(0.,POOLS(n+1,4)-wood_loss) ! then work out the adjustment due to burning if there is any if (post_harvest_burn(harvest_management) > 0.) then !/*first fluxes*/ !/*LABILE*/ CFF(1) = POOLS(n+1,1)*post_harvest_burn(harvest_management)*combust_eff(1) NCFF(1) = POOLS(n+1,1)*post_harvest_burn(harvest_management)*(1-combust_eff(1))*(1- rfac) CFF_res(1) = labile_residue*post_harvest_burn(harvest_management)*combust_eff(1) NCFF_res(1) = labile_residue*post_harvest_burn(harvest_management)*(1-combust_eff(1))*(1- rfac) !/*foliar*/ CFF(2) = POOLS(n+1,2)*post_harvest_burn(harvest_management)*combust_eff(2) NCFF(2) = POOLS(n+1,2)*post_harvest_burn(harvest_management)*(1-combust_eff(2))*(1- rfac)

66

CFF_res(2) = foliar_residue*post_harvest_burn(harvest_management)*combust_eff(2) NCFF_res(2) = foliar_residue*post_harvest_burn(harvest_management)*(1-combust_eff(2))*(1- rfac) !/*root*/ CFF(3) = 0. !POOLS(n+1,3)*post_harvest_burn(harvest_management)*combust_eff(3) NCFF(3) = 0. !POOLS(n+1,3)*post_harvest_burn(harvest_management)*(1-combust_eff(3))*(1- rfac) CFF_res(3) = 0. !roots_residue*post_harvest_burn(harvest_management)*combust_eff(3) NCFF_res(3) = 0. !roots_residue*post_harvest_burn(harvest_management)*(1-combust_eff(3))*(1- rfac) !/*wood*/ CFF(4) = POOLS(n+1,4)*post_harvest_burn(harvest_management)*combust_eff(4) NCFF(4) = POOLS(n+1,4)*post_harvest_burn(harvest_management)*(1-combust_eff(4))*(1- rfac) CFF_res(4) = wood_residue*post_harvest_burn(harvest_management)*combust_eff(4) NCFF_res(4) = wood_residue*post_harvest_burn(harvest_management)*(1-combust_eff(4))*(1- rfac) !/*litter*/ CFF(5) = POOLS(n+1,5)*post_harvest_burn(harvest_management)*combust_eff(5) NCFF(5) = POOLS(n+1,5)*post_harvest_burn(harvest_management)*(1-combust_eff(5))*(1- rfac) !/*fires as daily averages to comply with units*/ FLUXES(n,17)=(CFF(1)+CFF(2)+CFF(3)+CFF(4)+CFF(5) &

+CFF_res(1)+CFF_res(2)+CFF_res(3)+CFF_res(4))/deltat(n) ! update the residue terms labile_residue = labile_residue - CFF_res(1) - NCFF_res(1) foliar_residue = foliar_residue - CFF_res(2) - NCFF_res(2) roots_residue = roots_residue - CFF_res(3) - NCFF_res(3) wood_residue = wood_residue - CFF_res(4) - NCFF_res(4) ! now update NEE NEE(n)=NEE(n)+FLUXES(n,17) else FLUXES(n,17) = 0. CFF = 0. ; NCFF = 0. CFF_res = 0. ; NCFF_res = 0. end if ! update all pools this time POOLS(n+1,1) = max(0., POOLS(n+1,1) - CFF(1) - NCFF(1) ) POOLS(n+1,2) = max(0., POOLS(n+1,2) - CFF(2) - NCFF(2) )

67

POOLS(n+1,3) = max(0., POOLS(n+1,3) - CFF(3) - NCFF(3) ) POOLS(n+1,4) = max(0., POOLS(n+1,4) - CFF(4) - NCFF(4) ) POOLS(n+1,5) = max(0., POOLS(n+1,5) + (labile_residue+foliar_residue+roots_residue) + (NCFF(1)+NCFF(2)+NCFF(3)) ) POOLS(n+1,6) = max(0., POOLS(n+1,6) + (wood_residue- soil_loss_with_roots) + (NCFF(4)+NCFF(5))) ! this is intended for use with the R interface for subsequent post ! processing if (allocated(extracted_C)) then ! harvested carbon from all pools extracted_C(n) = (wood_loss- (wood_residue+CFF_res(4)+NCFF_res(4))) & + (labile_loss- (labile_residue+CFF_res(1)+NCFF_res(1))) & + (foliar_loss- (foliar_residue+CFF_res(2)+NCFF_res(2))) & + (roots_loss-(roots_residue+CFF_res(3)+NCFF_res(3))) end if ! allocated extracted_C ! total carbon loss from the system C_total = (labile_residue+foliar_residue+roots_residue+wood_residue+sum(NCFF)) & - (labile_loss+foliar_loss+roots_loss+wood_loss+soil_loss_with_roots+sum(CFF))

! if total clearance occured then we need to ensure some minimum ! values and reforestation is assumed one year forward if (met(8,n) > 0.99) then m=0 ; test=sum(deltat(n:(n+m))) ! FC Forest Statistics 2015 lag between harvest and restocking ~ 2 year restocking_lag = 365*2 do while (test < restocking_lag) m=m+1 ; test = sum(deltat(n:(n+m))) ! get out clause for hitting the end of the simulation if (m+n >= nodays) test = restocking_lag end do reforest_day = min((n+m), nodays) end if ! if total clearance

end if ! end deforestation info

! ! then deal with fire !

if (met(9,n) > 0.) then

!/*first fluxes*/ !/*LABILE*/ CFF(1) = POOLS(n+1,1)*met(9,n)*combust_eff(1) NCFF(1) = POOLS(n+1,1)*met(9,n)*(1-combust_eff(1))*(1-rfac) !/*foliar*/ CFF(2) = POOLS(n+1,2)*met(9,n)*combust_eff(2)

68

NCFF(2) = POOLS(n+1,2)*met(9,n)*(1-combust_eff(2))*(1-rfac) !/*root*/ CFF(3) = 0. ! POOLS(n+1,3)*met(9,n)*combust_eff(3) NCFF(3) = 0. ! POOLS(n+1,3)*met(9,n)*(1-combust_eff(3))*(1- rfac) !/*wood*/ CFF(4) = POOLS(n+1,4)*met(9,n)*combust_eff(4) NCFF(4) = POOLS(n+1,4)*met(9,n)*(1-combust_eff(4))*(1-rfac) !/*litter*/ CFF(5) = POOLS(n+1,5)*met(9,n)*combust_eff(5) NCFF(5) = POOLS(n+1,5)*met(9,n)*(1-combust_eff(5))*(1-rfac) !/*fires as daily averages to comply with units*/ FLUXES(n,17)=(CFF(1)+CFF(2)+CFF(3)+CFF(4)+CFF(5))/deltat(n)

!/*all fluxes are at a daily timestep*/ NEE(n)=NEE(n)+FLUXES(n,17)

!// update pools !/*Adding all fire pool transfers here*/ POOLS(n+1,1)=POOLS(n+1,1)-CFF(1)-NCFF(1) POOLS(n+1,2)=POOLS(n+1,2)-CFF(2)-NCFF(2) POOLS(n+1,3)=POOLS(n+1,3)-CFF(3)-NCFF(3) POOLS(n+1,4)=POOLS(n+1,4)-CFF(4)-NCFF(4) POOLS(n+1,5)=POOLS(n+1,5)-CFF(5)- NCFF(5)+NCFF(1)+NCFF(2)+NCFF(3) POOLS(n+1,6)=POOLS(n+1,6)+NCFF(4)+NCFF(5)

end if ! end burnst area issues

! do nxp = 1, nopools ! if (POOLS(n+1,nxp) /= POOLS(n+1,nxp)) then ! print*,"step",n, nxp ! print*,"met",met(:,n) ! print*,"POOLS",POOLS(n,:) ! print*,"FLUXES",FLUXES(n,:) ! stop ! endif ! enddo

end do ! nodays loop ! deallocate(tmp_x, tmp_m) end subroutine CARBON_MODEL ! !------! double precision function acm(drivers,constants)

! the Aggregated Canopy Model, is a Gross Primary Productivity (i.e. ! Photosyntheis) emulator which operates at a daily time step. ACM can be ! paramaterised to provide reasonable results for most ecosystems.

implicit none

! declare input variables

69

double precision, intent(in) :: drivers(12) & ! acm input requirements ,constants(10) ! ACM parameters

! declare local variables double precision :: gc, pn, pd, pp, qq, ci, e0, dayl, cps, dec, nit & ,trange, sinld, cosld,aob, mult & ,mint,maxt,radiation,co2,lai,doy,lat & ,deltaWP,Rtot,NUE,temp_exponent,dayl_coef & ,dayl_const,hydraulic_exponent,hydraulic_temp_coef & ,co2_comp_point,co2_half_sat,lai_coef,lai_const

! initial values gc=0.0 ; pp=0.0 ; qq=0.0 ; ci=0.0 ; e0=0.0 ; dayl=0.0 ; cps=0.0 ; dec=0.0 ; nit=1.0

! load driver values to correct local vars lai = drivers(1) maxt = drivers(2) mint = drivers(3) nit = drivers(4) co2 = drivers(5) doy = drivers(6) radiation = drivers(8) lat = drivers(7)

! load parameters into correct local vars ! pi = drivers(11) deltaWP = drivers(9) Rtot = drivers(10) NUE = constants(1) dayl_coef = constants(2) co2_comp_point = constants(3) co2_half_sat = constants(4) dayl_const = constants(5) hydraulic_temp_coef = constants(6) lai_coef = constants(7) temp_exponent = constants(8) lai_const = constants(9) hydraulic_exponent = constants(10)

! determine temperature range trange=0.5*(maxt-mint) ! daily canopy conductance, of CO2 or H2O? gc=abs(deltaWP)**(hydraulic_exponent)/((hydraulic_temp_coef*Rtot+trange)) ! maximum rate of temperature and nitrogen (canopy efficiency) limited photosynthesis (gC.m-2.day-1) pn=lai*nit*NUE*exp(temp_exponent*maxt) ! pp and qq represent limitation by diffusion and metabolites respecitively pp=pn/gc ; qq=co2_comp_point-co2_half_sat ! calculate internal CO2 concentration (ppm) ci=0.5*(co2+qq-pp+sqrt((co2+qq-pp)**2-4.0*(co2*qq- pp*co2_comp_point))) ! limit maximum quantium efficiency by leaf area, hyperbola

70

e0=lai_coef*(lai*lai)/((lai*lai)+lai_const) ! calculate day length (hours) ! This is the old REFLEX project calculation but it is wrong so anyway here ! we go... dec=-23.4*cos((360*(doy+10.)/365)*deg_to_rad)*deg_to_rad mult=tan(lat*deg_to_rad)*tan(dec) if (mult>=1) then dayl=24.0 else if (mult<=-1) then dayl=0. else dayl=24.*acos(-mult)/pi end if

! ------! calculate CO2 limited rate of photosynthesis pd=gc*(co2-ci) ! calculate combined light and CO2 limited photosynthesis cps=e0*radiation*pd/(e0*radiation+pd) ! correct for day length variation acm=cps*(dayl_coef*dayl+dayl_const)

! don't forget to return return

end function acm ! !------! double precision function linear_model_gradient(x,y,interval)

! Function to calculate the gradient of a linear model for a given dependent ! variable (y) based on predictive variable (x). The typical use of this ! function will in fact be to assume that x is time.

implicit none

! declare input variables integer :: interval double precision, dimension(interval) :: x,y

! declare local variables double precision :: sum_x, sum_y, sumsq_x,sum_product_xy

! calculate the sum of x sum_x = sum(x) ! calculate the sum of y sum_y = sum(y) ! calculate the sum of squares of x sumsq_x = sum(x*x) ! calculate the sum of the product of xy sum_product_xy = sum(x*y) ! calculate the gradient

71

linear_model_gradient = ( (interval*sum_product_xy) - (sum_x*sum_y) ) & / ( (interval*sumsq_x) - (sum_x*sum_x) )

! for future reference here is how to calculate the intercept ! intercept = ( (sum_y*sumsq_x) - (sum_x*sum_product_xy) ) & ! / ( (interval*sumsq_x) - (sum_x*sum_x) )

! don't forget to return to the user return

end function linear_model_gradient ! !------! ! !------! end module CARBON_MODEl_MOD

72

APPENDIX B – SCRIPTS USED TO RUN BENCHMARKS

SHELL SCRIPT USED TO RUN BENCHMARKS #!/bin/bash #Outer Directory for OD in *; do if [ -d "${OD}" ]; then cd ${OD} #loop on projects for D in *; do if [ -d "${D}" ]; then #solutionsRequested=1000 solutionsRequested=1000000 echo "Starting Processing " ${D}; echo "Starting Make FOR: " ${D}; #Build project cd ${D} && make clean && make && cd ..; DBP=`pwd`"/${D}"; #dir being processed SECONDS=0 echo "Calling CARDAMOM For: " ${D}; #Execute 4 times for y in {0..3}; do #multiply the current number by the number of cores available #To execute for 16 concurrent runs curOuter=$(($y*16)) for x in {1..16}; do cur=$(($curOuter+ $x)) (time ${DBP}/CARDAMOM_OUTPUTS/DALEC_GSI_DFOL_FR_MHMCMC/UK_all_forestry_withmpi_obs/ EXECUTABLE/cardamom.exe ${DBP}/CARDAMOM_OUTPUTS/DALEC_GSI_DFOL_FR_MHMCMC/UK_all_forestry_withmpi_obs/ DATA/UK_all_forestry_withmpi_obs_05850.bin ${DBP}/CARDAMOM_OUTPUTS/DALEC_GSI_DFOL_FR_MHMCMC/UK_all_forestry_withmpi_obs/ RESULTS/UK_all_forestry_withmpi_obs_05850_${cur}_ $solutionsRequested 0 1000 > ${DBP}/outs/out${cur}.txt ) > ${DBP}/outs/perf${cur}.txt 2>&1 & done wait done echo "Finished CARDAMOM FOR: " ${D} "In " $SECONDS "seconds"; fi done cd ..; fi done

73

SCRIPT USED TO COLLECT RESULTS from os import listdir from os.path import isfile, join, isdir from datetime import datetime, timedelta import csv import re myDir = "." onlyDir = [d for d in listdir(myDir) if isdir(d)] with open('Results.csv', 'wb') as csvfile: spamwriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) for folder in onlyDir: mypath = join (folder, "outs") onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))] regex = re.compile("user (\d*m\d*\.\d*)") myarray = [] totalAllRuns = 0.0 filesRead =0 max = 0 min=999999999 fileFailed=0 for file in onlyfiles: curRun=0 # read file filepath = join(mypath, file) file = open(filepath) data = file.read() matches = regex.findall(data) if len(matches) > 0: curRun = datetime.strptime(matches[0], "%Mm%S.%f") curRun = timedelta(minutes=curRun.minute, seconds=curRun.second, microseconds=curRun.microsecond) curRun = curRun.total_seconds() if curRun >0: filesRead += 1 myarray.append(curRun) totalAllRuns = totalAllRuns + curRun print str(filepath) + " Took " + str(curRun)

if (curRun > max): max = curRun if (curRun

print "AVG " + str((totalAllRuns/filesRead)) print "MIN: " + str(max) print "MAX: " + str(min) 74

print "" print "Total:" + str(totalAllRuns) if(fileFailed): print(str(fileFailed) + " files failed processing") print "END RESULTS FOR " + str(folder) print "------" print "" print "" print "" print ""

75

APPENDIX C – SIMPLE UNIT TEST

@test subroutine CARDAMOM() use CARBON_MODEL_MOD use pfunit_mod implicit none integer :: n,p integer :: nopars & ! number of paremeters in vector ,nomet & ! number of meteorological fields ,nofluxes & ! number of model fluxes ,nopools & ! number of model pools ,nodays ! number of days in simulation

double precision :: met(14,120) & ! met drivers ,deltat(120) & ! time step in decimal days ,pars(36) & ! number of parameters ,lat ! site latitude (degrees)

double precision, dimension(120) :: lai & ! leaf area index ,GPP & ! Gross primary productivity ,NEE ! net ecosystem exchange of CO2

double precision, dimension((120+1),6):: POOLS ! vector of ecosystem pools

double precision, dimension(120,18) :: FLUXES ! vector of ecosystem fluxes !!! VARIABLES FOR EXPECTED OUTPUT !!! integer :: Onopars & ! number of paremeters in vector ,Onomet & ! number of meteorological fields ,Onofluxes & ! number of model fluxes ,Onopools & ! number of model pools ,Onodays ! number of days in simulation

double precision :: Omet(14,120) & ! met drivers ,Odeltat(120) & ! time step in decimal days ,Opars(36) & ! number of parameters ,Olat ! site latitude (degrees)

double precision, dimension(120) :: Olai & ! leaf area index ,OGPP & ! Gross primary productivity ,ONEE ! net ecosystem exchange of CO2

double precision, dimension((120+1),6):: OPOOLS ! vector of ecosystem pools

double precision, dimension(120,18) :: OFLUXES ! vector of ecosystem fluxes PRINT *, ("STARTING READING ...") open(unit=102, file="inputToCARBONMODEL.txt") read(102,*)nodays PRINT *, ("days ..."), nodays 76

read(102,*)nopars PRINT *, ("pars ..."), nopars read(102,*)nomet read(102,*)nopools read(102,*)nofluxes read(102,*) met read(102,*)pars read(102,*)deltat read(102,*)lat read(102,*)lai read(102,*)NEE read(102,*)FLUXES read(102,*)POOLS read(102,*)GPP close(102)

call CARBON_MODEL(met,pars,deltat,nodays,lat,lai,NEE,FLUXES,POOLS & ,nopars,nomet,nopools,nofluxes,GPP)

open(unit=102, file="OUTPUT_FROM_CARBONMODEL.txt") read(102,*)Onodays read(102,*)Onopars read(102,*)Onomet read(102,*)Onopools read(102,*)Onofluxes read(102,*)Omet read(102,*)Opars read(102,*)Odeltat read(102,*)Olat read(102,*)Olai read(102,*)ONEE read(102,*)OFLUXES read(102,*)OPOOLS read(102,*)OGPP close(102)

!loop On days, For Pools do n=1, nodays+1 do p=1, noPools if(isnan(OPOOLS(n,p))) then @assertIsNaN(POOLS(n,p)) else @assertEqual(OPOOLS(n,p), POOLS(n,p), 0.5d-9, "failed on pools") end if end do end do

!loop On days, For Fluxes do n=1, nodays do p=1, noFluxes if(isnan(OFluxes(n,p))) then

77

@assertIsNaN(Fluxes(n,p)) else @assertEqual(OFluxes(n,p), Fluxes(n,p), 0.5d-9, "failed on fluxes") end if end do end do

!Test Lai do n=1, nodays if( isnan(OLai(n))) then @assertIsNaN( Lai(n)) else @assertEqual(OLai(n), Lai(n), 0.5d-9,"failed on LAI") end if end do

!Test GPP do n=1, nodays

if(isnan(OGPP(n))) then @assertIsNaN(GPP(n)) else @assertEqual(OGPP(n), GPP(n), 0.5d-9, "failed on GPP") end if end do

!Test NEE do n=1, nodays if(isnan(ONEE(n))) then @assertIsNaN(NEE(n)) else @assertEqual(ONEE(n), NEE(n), 0.5d-9, "failed on NEE") end if end do end subroutine CARDAMOM

78

APPENDIX D – OTHER RESULTS

RESULTS WHICH WERE NOT COLLECTED WITH EDDIE The following results presented have been collected from a desktop computer using an Intel Core I7 5829Ok (Haswell) which has 6 cores running @3.30 GHz, L1 data cache of 32KB for each core, L1 instruction cache of 32KB for each core, 256KB of L2 cache for each core, and a shared L3 cache of 15MB, Hyper threading was disabled during the tests.

The results collected are of 5 concurrent runs requesting 1,000,000 samples with the UK Forestry sample file.

79

1200

1000

800

600 Average Time Taken (s) Taken TimeAverage 400

200

0 -O3 - O2 -xhost O2 -xhost -O3 - -O3 - PERF GEN PERF GEN -O0 - O2 -xhost O2 -xhost -O3 - xhost - O0 O2 -ipo -no- -ipo-c - xhost -ipo xhost - -fast -Ofast generated generated xhost -ipo -ipo-c xhost -ipo ipo-c -no- ftz no-ftz -no-ftz ipo-c from O0 with O2 ftz Random Seed 1077.8917 1103.4207 404.9385 313.32606 326.24818 369.7069 372.5438 490.84656 510.6212 498.36154 489.95836 456.92 454.79154 393.5457 336.27 Compilation Strategy

Figure D-1 Benchmark of different compilation strategies of the Intel Fortran Compiler, reported average is of 5 concurrent runs requesting 1,000,000 parameters for UK forestry sample file, tests were run on a desktop pc running Linux that has an Intel Haswell I7 58290K processor

1400

1200

1000

800

600 Average Time Taken(s) TimeAverage

400

200

0 -O2 -march="core- -O3 -march="core- -O2 -march="core- avx2" -funsafe- -O3 -march="core- -Ofast - -Os -march="core- -O0 -O2 avx2" -ffast-math - avx2" math- avx2" march="core-avx2" avx2" funroll-loops optimizations Random Seed 1292.556 760.935 814.262 752.495 751.963 743.231 681.447 791.822 Single Seed 1124.805 880.21 890.244 881.621 894.81 784.673 787.129 941.542 Compilation Stratregy

Figure D-2 Benchmark of different compilation strategies of the GNU Fortran Compiler, reported average is of 5 concurrent runs requesting 1,000,000 parameters for UK forestry sample file, tests were run on a desktop pc running Linux that has an Intel Haswell I7 58290K processor

81

RESULTS COLLECTED WITH EDDIE The following result has been collected with the same procedure described in the research methods section but was not considered of main interest within the main document.

900.00

800.00

700.00

600.00

500.00

400.00

300.00

200.00

100.00

0.00 O3-xhost-ipo-no- O3-xhost- O3-xhost-ipo O3xhost-ipo-no-ftz ftzSingleFile ipoSingleFile Random Seed 596.11 694.52 702.40 703.85 Single Seed 731.22 749.20 569.37 618.12

Figure D-3 Different compilation strategies based on O3 optimisation level for the Intel Fortran compiler

BIBLIOGRAPHY

[1] Y. Luo, “Terrestrial Carbon–Cycle Feedback to Climate Warming,” Annual Review of Ecology, Evolution, and Systematics, 2007. [Online]. Available: http://www.jstor.org/stable/30033876?seq=1#page_scan_tab_contents. [Accessed: 24-May- 2016]. [2] G. B. Bonan, “Forests and climate change: forcings, feedbacks, and the climate benefits of forests.,” Science (80-. )., vol. 320, no. 5882, pp. 1444–1449, Jun. 2008. [3] A. A. Bloom and M. Williams, “Constraining ecosystem carbon dynamics in a data-limited world: integrating ecological ‘common sense’ in a model–data fusion framework,” Biogeosciences, vol. 12, no. 5, pp. 1299–1315, Mar. 2015. [4] Y. Luo, K. Ogle, C. Tucker, S. Fei, C. Gao, S. LaDeau, J. S. Clark, and D. S. Schimel, “Ecological forecasting and data assimilation in a data-rich era,” Ecol. Appl., vol. 21, no. 5, pp. 1429–1442, Jul. 2011. [5] S. Thorn, “Eddie,” Eddie Confluence Pages, 2016. [Online]. Available: https://www.wiki.ed.ac.uk/display/ResearchServices/Eddie. [Accessed: 07-Jun-2016]. [6] S. Thorn, “Memory Specification,” Eddie Confluence Pages, 2016. [Online]. Available: https://www.wiki.ed.ac.uk/display/ResearchServices/Memory+Specification. [7] P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D’Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton, “Haswell: The fourth-generation intel core processor,” IEEE Micro, vol. 34, no. 2, pp. 6–20, 2014. [8] Taylor K. (Intel), “What exactly is a P-state? (Pt. 1).” [Online]. Available: https://software.intel.com/en-us/blogs/2008/05/29/what-exactly-is-a-p-state-pt-1. [Accessed: 08-Jun-2016]. [9] Intel®, “Intel® Xeon® Processor E5-2630 v3 (20M Cache, 2.40 GHz).” [Online]. Available: http://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz. [Accessed: 08-Jun-2016]. [10] Intel®, “Intel® Turbo Boost Technology 2.0 Higher Performance When You Need It Most.” [Online]. Available: http://www.intel.com/content/www/us/en/architecture-and- technology/turbo-boost/turbo-boost-technology.html. [Accessed: 08-Jun-2016]. [11] Intel Corporation, “Intel Xeon Processor E5 v2 Product Family Processor,” no. February, pp. 9–10, 2016. [12] M. Wittmann, T. Zeiser, G. Hager, and G. Wellein, “Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero.” [13] “Using Automatic Vectorization | Intel® Software.” [Online]. Available: https://software.intel.com/en-us/node/522572. [14] Intel, “Intel 64 and IA-32 architectures optimization reference manual,” no. June, 2016.

83

[15] “Profile-Guided Optimizations Overview | Intel® Software.” [Online]. Available: https://software.intel.com/en-us/node/524794. [16] Fog Anger (Technical University of Denmark), Optimizing software in C++ - An optimization guide for Windows, Linux and Mac platforms. Denmark, 2015. [17] Agner Fog (Technical University of Denmark), “4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs.” Denmark, pp. 185– 200, 2016. [18] S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes. 2001. [19] “Compiler Option Categories and Descriptions | Intel® Software.” [Online]. Available: https://software.intel.com/en-us/node/524884. [20] “O | Intel® Software,” Intel Fortran Documentation, 2016. [Online]. Available: https://software.intel.com/en-us/node/524898. [Accessed: 05-Aug-2016]. [21] “prec-div, Qprec-div | Intel® Software,” Intel Fortran Compiler Documentation, 2016. [Online]. Available: https://software.intel.com/en-us/node/522989. [Accessed: 05-Aug-2016]. [22] “fp-model, fp | Intel® Software,” Intel Fortran Compiler Documentation, 2016. [Online]. Available: https://software.intel.com/en-us/node/522979. [Accessed: 05-Aug-2016]. [23] “ipo, Qipo | Intel® Software,” Intel Fortran Compiler Documentation. [Online]. Available: https://software.intel.com/en-us/node/522852. [24] “ftz, Qftz | Intel® Software,” Intel Fortran Compiler Documentation. [Online]. Available: https://software.intel.com/en-us/node/522985. [25] “xHost, QxHost | Intel® Software,” Intel Fortran Compiler Documentation. [Online]. Available: https://software.intel.com/en-us/node/522846. [26] “x, Qx | Intel® Software,” Intel Fortran Compiler Documentation. [Online]. Available: https://software.intel.com/en-us/node/522845. [27] C. S. (Intel), “x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are- Zero (DAZ),” 2008. [Online]. Available: https://software.intel.com/en-us/articles/x87-and-sse- floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz. [Accessed: 08-Oct- 2016]. [28] “Basic Hotspots Analysis | Intel® Software,” Intel Vtune Amplifier XE 2016 and Intel Vtune Ampolifier 2016 for Systems Help , 2016. [Online]. Available: https://software.intel.com/en- us/node/544020. [Accessed: 07-Aug-2016].

84