Jornada de Seguimiento de Proyectos, 2010 Programa Nacional de Tecnolog´ıasInform´aticas

Hardware and Software Support for High Performance Computing TIN2007-67537-C03

Javier D´ıazBruguera ∗ Ram´onDoallo Biempica † Universidad de Santiago de Compostela Universidad de A Coru˜na

Abstract The objectives of this project are a continuation of the results obtained during the de- velopment of the project TIN2004-07797-C02. Taking these results as our starting point, we have dealt with new objectives, some of them oriented to solve the new challenges that arise with the installation of the Finisterrae at CESGA (Galician Supercomputing Center) in 2007. The objectives are organized into three main areas: (1) Performance and programmability improvement of HPC systems, improving the function- ality of HPC systems, with special focus on irregular codes, exploring two approaches, analytical modelling, and runtime solutions. (2) Software tools for HPC and Grid facili- ties, developing middleware for system management of the Finisterrae supercomputer and Grid environments. (3) Performance improvement for multimedia applications and general purpose processors, where we tackle the design of algorithms and architectures for video compression, real time visualization and functional units of general purpose processors. Keywords: High performance computing, , Constellation architecture, multicore and multithreaded processors, efficient software.

1 Objectives of the project

It has to be pointed out that the project proposal comprised three subprojects and research groups: USC, UDC and CESGA; however, the project was finally approved with only of the three subprojects, USC and UDC. This have affected to some of the objectives Then, two groups have been involved in this proposal, the Computer Architecture Group at the University of Santiago de Compostela (USC Group) and the Computer Architecture Group at the University of A Coru˜na(UDC Group). As global background, the project we propose is a continuation of the research lines being developed by the USC and UDC groups in the last years about High performance computing, both at the hardware side and at the software side. The objectives are organized into three main areas: 1. Improvement of the performance and programmability of HPC systems. The main concern of this part of the project is to improve the functionality of HPC systems, with

∗Email: [email protected] †Email: [email protected] TIN2007-67537-C03 special focus on irregular codes. We organized the proposal into two main topics: the char- acterization of irregular codes, combining compiler and run-time techniques, and the study of the functionalities of PGAS languages as an efficient alternative for constellation architectures (like Finisterrae).

(a) Compiler and Run-time support for performance analysis and optimization of irregular codes. We explore two approaches to deal with their complexity: analytical modeling, and runtime solutions such as inspector/executor. Both approaches require compiler support for an- alyzing these complex codes. This support will be provided by XARK (http://xark.des.udc.es), a compiler framework developed by the UDC Group [4].

(b) Analysis and improvement of performance and programmability in HPC systems using PGAS approaches. The objectives of this line are: (1) to compare the programmability and performance using traditional approaches versus PGAS languages, (2) to propose performance optimizations and programmability enhancements by means of PGAS language extensions and/or libraries, and (3) to extend the usability and performance features of the HTA for the programming of hybrid systems such as constellation architectures.

2. Software tools for HPC and GRID facilities. The understanding and characterization of the performance of Grid applications and the accurate simulation of Grid systems is one of the focus of this research line.

(a) Management of large-scale HPC facilities. The research lines proposed in this new project take advantage of the results achieved in the previous MEC project. Specifically, the first goal is to use AdCIM [27] to develop customized and integrated system administration applications for the Finisterrae constellation architecture and for Grid. Objectives related to the application of AdCIM and one of them ”Application of the AdCIM framework for the systematic development of customized tools for selected administration domains of CESGA ”, cannot not be carried out since CESGA subproject was not funded, and the staff and infrastructure of CESGA was essential for the achievement of this objective.

(b) Fault tolerance of high-performance applications. In the frame of the previous MEC project a tool named CPPC (Controller/Precompiler for Portable Checkpointing) has been developed [23]. In this project we continue the development of the CPPC tool to achieve the complete automation of the , so that the tool makes a source-to-source transformation of a MPI code in a fault-tolerant one by inserting the necessary functions of the library in safe points.

(c) Software support for performance characterization and optimization of Grid applications. We propose to study the optimization of Grid resources to execute massively computational applications efficiently. Therefore, we consider the Grid as a huge computing system to execute very time-consuming applications. To deal with this topic, some objectives will be considered: the understanding and characterization of the performance of these applications in Grids, the accurate simulation of Grid systems to reproduce real executions, and the improvement of the performance of the execution of these HPC applications.

3. Hardware for multimedia and general purpose processors. We tackle the im- plementation of video compression algorithms on programmables processors, exploiting the parallelism provided by the organization of the processors. On the other hand, we focus on the TIN2007-67537-C03 design of improved algorithms and architectures for the computation of essential operations for multimedia and other applications. (a) Algorithms and architectures for multimedia. We focus on processor with EPIC and VLIW architectures. These processors provide instruction level parallelism and require efficient pro- gramming methodologies to exploit the SIMD programming paradigm, multithreading, the software pipelining and memory hierarchy. We will implement the video compression algo- rithms in those architectures. An special effort is done on the development of new algorithms for motion estimation. Moreover, we have addressed the design of units for real time visual- ization for various applications. (b) Design of functional units for general-purpose processors. Our goal in this project is to improve even more the implementation of essential operations, as square root and inverse square root, developing multiplicative algorithms with reduced latency. Computations related with multimedia are error sensitive. That means, that small errors can propagate and result in large final errors; then we propose the use of error estimates that could help to obtain more reliable results. Another topic addressed in the project is the design of decimal floating-point hardware

2 Level of success achieved in the project

It has to be pointed out that the project proposal comprised three subprojects and research groups: USC, UDC and CESGA; however, the project was finally approved with only of the three subprojects, USC and UDC. This might have affected to the level of success of some of the objectives, but most of the objectives has been addressed. The two groups finally participating in the project have a strong research collaboration for many years. In fact there several topics that are being developed by teams composed of members of both groups. In any case, it is clear that the research interest of both groups are complementary. In order to understand how the objectives of the project are being deal with, we indicate for every objective in the previous section the level of success and the group involved in its development. Note that in the technical memory of the project the objectives were decomposed in a set of more detailed subobjectives or tasks, that for space reasons are not listed here. We strongly recommend to see these tasks in the technical memory. For every topic a few representative publications are included, although there are other publications not referenced here. Most outstanding results are: 1. Improvement of the performance and programmability of HPC systems (a) Compiler and Run-time support for performance analysis and optimization of irregular codes (UDC, USC). This objective had several tasks: (1) extension of XARK, (2) analysis and optimization of complex memory hierarchies, and (3) run-time characterization and perfor- mance improvement of parallel irregular codes. The XARK compiler framework (http://xark.des.udc.es) developed by the Computer Ar- chitecture Group of the UDC has been extended. In this part of the project two main research lines have been conducted. The main contribution of the first research line is the formaliza- tion of a recognition algorithm that enables to build a hierarchical representation of a pro- gram using the concept of computational kernel. This hierarchical representation provides an TIN2007-67537-C03 optimizing compiler with relevant information for improving the performance of a program on general-purpose parallel architectures (e.g., multi-core processors) and on specific-purpose parallel architectures such as GPUs. The internals of XARK consist of two demand-driven algorithms that analyze the Gated Single Assignment (GSA) form. This approach was shown to be a generic, robust and extensible solution for automatic recognition of computational ker- nels. Another important contribution of the first research line is the definition of a collection of computational kernels that are representatives of regular and irregular real applications. Finally, note that the internals of the XARK compiler framework have been evaluated with a set of well-known benchmark suites from different application domains, more specifically, Sparskit-II, Perfect Club, SPEC CPU2000 and the PLTMG code. The second research line is the construction of an interprocedural GSA (IGSA) form. This IGSA form will be the basis to extend XARK for the recognition of complex syntactic variants of computational kernels that have been implemented using function/procedure calls. At this moment, a simple and fast algorithm to build IGSA on top of the Static Single Assignment (SSA) form available in modern optimizing compilers has been developed [3]. IGSA hinges of the concept of locator as an abstraction of contiguous and non-contiguous memory regions represented as data structures that use arrays and/or pointer variables. Work in progress focuses on the evaluation of the IGSA construction algorithm using an implementation based on the GIMPLE-SSA intermediate representation of the GNU GCC compiler. The algorithm is evaluated in terms of memory consumption and execution time using C and Fortram codes from well-known benchmark suites. Finally, a new research line that is a natural evolution of the objectives of this part of the project has been started. The objective is to propose a new intermediate representation of parallelizing compilers using the concept of computational kernel of the XARK compiler framework [4]. With respect to the analysis and optimization of complex memory hierarchies, we completed the modeling of the Discrete Fourier Transform (DFT) using Probabilistic Miss Equations (PMEs) [10]. Namely, we focused on the DFTs generated by the SPIRAL tool from Carnegie- Mellon University. These codes present very complex access patterns. The accurate modeling of their performance required extending the PME model to consider physically indexed caches as well as hardware prefetching. The results were very successful: our model can not only find DFTs of a quality similar to that of these DFTs while requiring much shorter search times, but sometimes it can even find faster codes. Moreover, we extended the Probabilistic Miss Equations (PME) model to estimate the Worst Case Memory Performance (WCMP) of codes with regular access patterns in the absence of the data structures base address information [1]. This is a very interesting feature, since such addresses are not available in many situations. Our first approach was not completely safe, failing to provide the WCMP in some situations. In the last year we developed a totally safe prediction. A model to predict the Best Case Memory Performance (BCMP), and a model to predict the WCMP of codes with irregular access patterns were also developed. Finally, we designed a cache called Set Balancing Cache (SBC) that is able to shift lines from sets whose working sets cannot fit in them to sets with more lines than their working set [26]. Comparisons with related work proved the efficiency of SBC design in terms of performance, area and power consumption. (b) Analysis and improvement of performance in HPC (USC). A framework based on a method- TIN2007-67537-C03 ology to obtain analytical models of MPI applications on multiprocessor environments has been developed [15]. The framework consists of an instrumentation stage followed by an analysis stage, which are based on the CALL instrumentation tool and in the R language, respectively. A number of functionalities were developed in the analysis stage to help the performace analyst to study analytical models in parallel environments. One of the most relevant features is an automatic fit process to obtain an analytical model of parallel programs. An efficient and flexible process to automatically generate and fit all analytical attempt functions was developed. The process allows the user the introduction of information about the behavior of the code, so that the precision of the obtained model will depend on the amount of the provided information. If no information is introduced, the monitored parameters will be used to build the initial functions. This process will provide a full search on all possible attempt analytical functions with physical meaning. A methodology to obtain automatically a complete characterization of the communication behavior in MPI environments was developed [16]. This method automatically detects the message sizes where the communication behavior changes. Thereby, the range of message size can be split in different intervals. As the communication behavior inside each interval presents a linear dependency with the message size, each interval can be characterized by its own LogP- based parameter set. Both the parameter assessment and the detection of different commu- nication behaviors are obtained using microbenchmark measurements. The detection process splits the message size range into continuous intervals with different behaviors of overhead and gap. The procedure finds significant changes on both features, and groups measurements based on the similarity of the parameters estimation using data mining techniques. The parameters of some LogP-based models are assessed for each message size interval. Some practical results in real systems show the accuracy of the method as it can automatically detect architectural characteristics that influence the performance of the communications on MPI environments. In addition, a method for obtaining statistical models to characterize the performance of parallel codes has been also proposed. This method, based on model selection techniques using the Akaike’s information criterion, provides the user with a statistical model of the code under study, as well as statistical information that can be used to validate the proposed model. The proposed methodology generates a set of candidate models based on information that can be provided by the user. This information contains a number of metrics and variables that might influence the performance of the application. Thus, a model selection technique based on AIC is used to select the most accurate model. Furthermore, the Akaike weights and the relative importance of the terms are also calculated. This information helps the user in the evaluation of the appropriateness of the proposed best model, and it provides a guide of how to improve the analysis process. This method has been implemented in the modeling framework, and validated for different situations as different implementations of collective operations in Open MPI and NPB benchmarks with different communication to computation ratio. Another research line focussed on a study to characterize Finisterrae, a several-node NUMA system comprising two SMP cells per node and four dual-core Itanium2 Montvale processor per cell. The main objective was to determine the performance effect of bus contention and cache coherency as well as the suitability of porting strategies regarding irregular codes in such a complex architecture. Results show that, for big data sizes, the effect of sharing a bus degrades the final performance but masks the cache coherency effects [21]. Furthermore, these outcomes allow us to study -to-core mappings and memory allocation policies. We are TIN2007-67537-C03 currently working in the development of strategies to guide applications in runtime that will be especially relevant and will facilitate the task of a hypothetical scheduler when defining a policy to map threads to cores and to allocate memory in a cell [20]. A different research focussed on the idea of effectively use on-chip hardware counters to improve the performance of the memory accesses in runtime [19]. We have selected two dif- ferent contexts in which the benefits of this idea can be important: the efficient execution of irregular codes, and the problem of use page migration to improve the execution of parallel codes. Previous developments in terms of IRAD and distance metrics were adapted to this new context. (c) Programmability in HPC systems using PGAS approaches (UDC). Regarding the compari- son of programmability and performance issues of traditional parallel programming paradigms and PGAS approaches, we can point out that a performance evaluation of PGAS UPC (both HP UPC and Berkeley UPC) on the Finisterrae supercomputer was compared to more tradi- tional approaches (MPI and OpenMP), using a series of representative kernel and application benchmarks. The performance results, detecting bottlenecks and inefficiencies in current UPC compilers and libraries, were analyzed. As a result, scalable performance is usually obtained through costly manual optimizations, making code development with UPC harder. Moreover, the UPC compiler technology is generally immature, so UPC codes usually show poorer performance than their counterpart C codes. We defined a methodology to characterize the UPC learning process. The application of this new approach on several training sessions for novice UPC programmers has shown that UPC is much easier to learn than the message-passing paradigm, although the obtained is usually much poorer. The results of this task strongly support the development of the following tasks, particu- larly the optimization of UPC libraries [11] and the improvement of the collective operations programmability, removing some restrictions that limit the general adoption of UPC. With respect to the development of PGAS language extensions and new libraries to enhance programmability and performance of HPC systems, focusing on irregular codes, we tackled the optimization of UPC standard collectives on multicore cluster configurations, taking advan- tage of inter/intra-node awareness, as well as data locality. UPC performance can be improved maximizing the locality of the data, reducing remote accesses, and through an efficient map- ping of threads to particular cores. Moreover we designed and developed a parallel numerical library for UPC (BLAS1, BLAS2 and BLAS3 dense operations). In order to assist UPC per- formance improvement, Servet, a suite of benchmarks focused on detecting a set of parameters of multicore systems (cache hierarchy, topology and sizes, as well as memory and network throughput), has been implemented [12]. The obtained results have high influence in the over- all performance, improving significantly UPC codes scalability. Finally, we have designed and developed libraries that improve UPC programmability, supporting irregular collective oper- ations, as well as computational kernels (e.g., HashMap, Montecarlo). Additionally, a set of collective operations without the need of using shared variables are provided (the avoidance of this restriction is important for both programmability and performance enhancement). On the other hand, a version of our Hierarchically Tiled Arrays (HTA) library was developed based on Threading Building Blocks (TBB), an API native to multicore systems that enables the exploitation of the advantages of [?]. We then compared task- oriented parallel programming using TBBs versus the data-parallel approach in which HTAs TIN2007-67537-C03 rely in shared memory systems. We concluded that that performance was similar, while HTAs improved programmer productivity. Usability enhancements to the class, such as HTA dynamic repartitioning and overlapped tiling were also developed. We also examined the design issues, opportunities and challenges that the migration of HTAs from distributed to shared memory environments brings. Finally, we began the exploration of the issues related to the application of to problems that are typically expressed with other parallelization paradigms as well as data structures different from the matrices that HTAs support. Expressing these new kinds of problems using data parallelism involves in our opinion extending data parallel programming with new abstractions. 2. Software tools for HPC and GRID facilities (a) Management of large-scale HPC facilities (UDC). The AdCIM framework (adcim.des.udc.es), a result of the previous TIN2004-07797-C02 project is a model-driven framework, based on the CIM model, for the management of large scale and heterogeneous systems. AdCIM uses a more efficient XML representation of the CIM model (called miniCIM) to represent management and configuration data extracted from managed machines. These data are extracted from a large number of configuration sources, such as flat text configuration files, by the use of custom text- to-XML parsing, and persisted transparently and scalability into an LDAP repository. AdCIM consolidates and integrates these data using the expressive power of the CIM model and its relation model, exposing them using a web service interface both to external applications and to web forms generated on-the-fly from CIM definitions using the XForms technology. This project had several objectives related to the application of AdCIM and one of them, Application of the AdCIM framework for the systematic development of customized tools for selected administration domains of CESGA supercomputers, could not be carried out since CESGA subproject was not funded, and the staff and infrastructure of CESGA was essential for the achievement of this objective. On the other hand, AdCIM has been extended to Grid environments, specifically the Moni- toring and Discovery System (MDS) of the Grid middleware Globus Toolkit. The Index Service of MDS collects data from various grid-managed sources and provides a query/subscription interface to those data. This interface connects the Index Service to existing services and monitoring agents. The AdCIM framework was successfully integrated into MDS [5], allowing to interrelate and represent new types of data in MDS’s index service and, for grid applica- tions, to access these data via MDS. Next, using the new Globus UsefulRP subsystem, we have developed a new CIM-based information provider which communicates with a new database backend for data storage, replacing the default Globus memory-backed backend. This new infrastructure greatly improved the scalability of the use of CIM data on Globus. Additionally, and based on the collaboration with the LORIA-INRIA MADYNES group in France, a new AdCIM extension was developed to manage Wireless Mesh Networks [8]. This extension was used to provide AdCIM with new ontological reasoning processes which apply knowledge to diagnose and troubleshoot various configuration problems. (b) Fault tolerance of high-performance applications (UDC). CPPC (ComPiler for Portable Checkpointing, http://cppc.des.udc.es) is a checkpointing tool focused on the insertion of fault tolerance into long-running message-passing applications [25]. It was initially developed in the context of the previous MEC project. During this new project we have continued the development of the CPPC tool to make the insertion of fault tolerance into message-passing TIN2007-67537-C03 codes totally transparent to the user [24]. Three analysis were incorporated to the compiler: (1) a data-flow analysis to determine the variables to be saved in the checkpoint file, (2) a message flow analysis to determine the safe points, that is, regions in the code where neither in-transit nor inconsistent communications exist, (3) a heuristic analysis of code complexity to select the best safe points where finally dumping the checkpoint file. CPPC was widely evaluated using a large number of very different applications on the Finisterrae supercomputer hosted by CESGA (Supercomputing Center of Galicia). The tests included benchmarks and real applications, sequential and parallel codes. A statistical approach was used to better estimate the performance of the tool. A service-based architecture called CPPC-G was developed. CPPC-G provides services for the automatic handling of checkpoint files generated by applications enabled by CPPC being executed on the Grid [7]. The new Grid services automatically ask for the necessary resources; start and monitor the execution; create backup copies of the checkpoint files; detect failed executions; and restart the failed application automatically. An experimental evaluation of the framework was performed, measuring the impact of the CPPC-G services in an application execution. Experimental results prove the small overhead introduced by CPPC-G, especially if compared to typical execution times of long-running applications.

(c) Software support for performance characterization and optimization of Grid applications (USC). An extension of the Gridsim toolkit to simulate parallel applications was developed. This extension consists of a set of modular components, and each of these components can be extended to include new capabilities. A job model was implemented to support different kinds of applications. Also, support for modelling custom internal networks was developed, and three of the most common network behaviors were implemented. Furthermore, different local parallel schedulers were implemented and support for them to be easily extended. Besides, due to the importance of resource failures in parallel applications on Grids, support for checkpointing and failure simulation was also added in the simulator. This toolkit establishes the basis to a systematic analysis of parallel applications in a Grid, and in particular, this could be used to analyze the performance of different metaschedulers. Metaschedulers based on analysis over this toolkit are currently being studied. Finally, this tool is useful to engineer Grid infrastructures for parallel applications and to analyze the effects of different building decisions. Also, it can help to design parallel Grid applications, because their behavior can be simulated, and they can be used to achieve a better performance. In addition, new metrics to validate the efficiency of schedulers on heterogeneous systems has been proposed. These metrics can be considered as generalizations of classical ones to represent execution times and throughput. These new metrics are specially useful for het- erogeneous systems, and in particular for grids. They quantify how far is the decision of the scheduler from the case of the best resource for each job. This implies the need of knowing the execution time of each job in the best resource for it. Usually only using simulations this can be known, however, a study based on estimations of runtime for real jobs has been also performed. The features of these metrics were analyzed and compared with the traditional metrics from theoretical and practical points of view. Apart from that, the WRFM model, that is a simulation program that is used both for op- erational forecasting and for research, was migrated to an environment of the eciency network. Although such models are often used on supercomputing platforms, the possibility of using this approach offers two main advantages: the transparent access to other supercomputing centers TIN2007-67537-C03 and the implementation in more grid computing resources, especially for statistical analysis of weather forecasts where necessary the execution of hundreds or thousands of cases with small variations in initial conditions. Finally, two different works were also perfromed: In one hand the integration of all available resources in computer labs in our University, using grid technology, to thus facilitate its reuse by the researchers from the universities of Galicia to solve specific problems. In the other hand, to analyze the technical feasibility and business model based on the use of external resources located in centers with high capacity and computing power for render-based applications. Hardware for multimedia and general purpose processors (a) Algorithms and architectures for multimedia (USC, UDC). A number of significant advances have been achieved in the field of the H.264 video standard. We have obtained the most interesting results in arithmetic coding, extending previous results to the decoder. Parallel decoding has been enhanced by speeding up arithmetic decoding using a new architecture. Efficient implementations of transform coding have been also developed, balancing the cost and the processing speed required to interface with other tasks. We have implemented several image coding and processing algorithms on a massively par- allel processor array (MPPA) by Ambric (www.ambric.com). Outstanding performance has been achieved by exploiting both data and functional parallelism across hundreds of processors. These results clearly outperform previous ones obtained in VLIW architectures. Therefore, we are currently more interested in orienting our research towards MPPAs. Our first implementation of an ASIP for video and image applications cover entropy coding and decoding [17]. An architecture has been proposed that achieves better results than any other programmable architecture we are aware of. For motion estimation and other tasks, we are analyzing the possibilities of MPPAs and studying how to make a breakthrough introducing new instructions [9]. (b) Design of functional units for general-purpose processors (USC). In binary floating–point arithmetic, we have focused in the improving of the rounding stage of multiplicative algorithms for division, square root and their reciprocals. We have developed several variations of the traditional rounding algorithm that avoids the calculation of the remainder in a large number of cases (see for example [22]), outperforming previous solutions in the literature. On the other hand we have been working on error estimates for floating–point computations. It is well-known that floating–point computations are error prone due to the accumulation of rounding errors. We have developed error estimates that allows an estimation of the accuracy of the computation [14] Regarding the implementation of binary transcendental functions we developed a new CORDIC algorithm with low latency for all the operating modes [2]. The resultant archi- tecture can trade-off latency for power by lowering the voltage, allowing a significant reduction of power consumption with same latency as previous fastest proposals. As part of the work, we developed a VHDL synthetizable code for the unit. This code is in the verification phase, in order to develop a IP core with the standards of quality required by the international industry. In the topic of adder design, we developed a framework that allows the specification of a broad family of adders, that will be the basis for a CAD tool for adder design. As part of this framework, we developed a simple method to to extend any prefix adder with the Ling enhancement to reduce latency [28]. TIN2007-67537-C03

Regarding the design of Decimal Floating-Point Units, we developed several algorithms and architectures for addition, multiplication, division and transcendental functions. Those algorithms improved previous implementations. Of special interest is the development of a new decimal multiplier, that allows a significant reduction in both latency and area compared to previous implementations. This research attracted the attention of IBM, that led to a joint project (13 months) to transfer the technology to the IBM R&D Center in B¨oblingen (Germany). It is expected that IBM will incorporate the developed architectures in future generations of their Power and Z series processors [29] Moreover, we have been working on the design of algorithms and units to achieve real- time visualization of various applications. B´eziersurfaces are one of the most useful primitives employed for high quality modeling in CAD/CAM tools and graphics software and traditionally the B´eziersurfaces are usually tessellated on the CPU and the set of generated triangles is sent to the GPU. The CPU-GPU bus can be a bottleneck. We have proposed two proposals for synthesizing the B´eziermodels directly in the GPU, based on the utilization of a set of virtual vertices on the vertex shader and on the efficient exploitation of the geometry shader capabilities. On the other hand, we have developed a new scheme to join models for hybrid terrain representation combining data with different topologies [6]. Besides, an architecture based on a local convexification algorithm for hybrid representation of terrain was designed and implemented on a Virtex-II FPGA. Moreover, we have proposed a method based on the application of an enriched hierarchical radiosity algorithm that provides high-quality illumination to an input scene with low resolu- tion objects [18]. Our method produces high-quality images with an important reduction in computational costs and we have developed approaches to deal with the high memory require- ments. Besides, we have implemented the Monte Carlo Radiosity algorithm on a GPU using CUDA. Finally, we have performed an implementation of shallow water simulation on a GPU using Brook+ based on computational kernel recognition techniques.

3 Achievement indicators

A stated in previous section, the USC group has participated in objectives 1(a), 1(b), 2(c), 3(a) and 3(b), and the the UDC group has participated in objectives 1(a), 1(b), 2(a), 2(b) and 3(a). The main results obtained as part of the project has been summarized in section 2. It has to be pointed out that the level of achievement of the objectives included in the project proposal is very high. All the objectives has been addressed during the execution of the project and interesting results has been obtained. Note that both groups had a large experience in the topics of the project and the objectives were a natural evolution of the research being developed in the groups; obviously, this facilitated to achieve every objectives. The achievement indicators are summarized in this section. Note that, for space reasons, the complete list of publications and PhD Thesis is not included, only the most representative one’s are referenced.

3.1 Subproject 1. USC group Several of the indicators summarized here can be found in the web site of the group (www.ac.usc.es). TIN2007-67537-C03

• Level of achievement of the objectives. Very high. As shown in previous section, the objectives tackled by the USC group has been achieved. • Relevance and originality of the results. The relevance and originality of the results has been outlined in the previous section • Publications. During the development of the project, 2007-2010, we have published 24 papers in journals, 70 conference papers and we have been the editor of 2 books. • Usefulness of the results and relationship with the social environment. Our experience in performance analysis is going to be used to develop a new project, funded by the Ministry of Industry of Spain (Avanza program), titled A new remote render, in collaboration with several local companies in the area of multimedia. On the other hand, we are participating in a research project, funded by HP Labs, in collaboration with the CESGA and the Computer Architecture Group at the University of A Coru˜na; the objective of this project is to adapt previous results obtained in our group to the HP NUMA systems. Finally, our experience in multimedia, particularly on video compression and computer graphics, allows us to participate in the development of the Center for Experimentation and Production of Digital Contents of the University of Santiago de Compostela, funded by the Ministry of Industry (Avanza Program), in collaboration with several groups of the University of Santiago de Compostela and local companies. • Training capacity of the group. The CAG-USC offers, participates as part of the Dep. of Electronics and Computer Science of the University of Santiago de Compostela and in conjunction with the UDC group, the PhD Program Interuniversity Program in Infor- mation Technologies, which includes a specialization in HPC technologies. The Program was awarded by the Ministry of Education (since its creation in 2004) with the Quality Mention (Menci´onde Calidad, ref. MCD2004-00378) for its academic excellence, which allowed to have additional funds from MEC for invited professors and mobility of stu- dents. This Program will be replaced in Sep. 2010 by the Master in High Performance Computing (currently in verification process), which will be fully focused on HPC and will use the supercomputing infrastructure provided by the Galicia Supercomputing Center (CESGA). There were five PhD students included in the project proposal. Three of that students have presented his PhD dissertation during the development of the project, Natalia Seoane, Manuel Aldegunde and Daniel Piso, and the other two are finishing it and they will present the PhD dissertation soon. Moreover, as part of funding provided for the development of the project, three new students have been incorporated to the research team, one research contract and two grants. These students are still working on its PhD and the presentation is expected in one or two years. What is more, during the devel- opment of the project the group has incorporated other PhD students to work in related areas and funded by other research projects and contracts and some PhD dissertations has been finish as well. In summary, from the starting date of the project six PhD dis- sertations has been presented and actually there are 11 PhD students working on topics related with project. • Collaboration with other groups. Our group maintains collaboration with several groups in Spain and abroad, Institute of Computer Science at Academy of Sciences of the Czech TIN2007-67537-C03

Republic, University of California at Irvine, Politenico di Torino, University of M´alaga, HP labs, IBM, Centro de Supercomputaci´onde Galicia (CESGA), we are participating with several local companies in a research project funded by the Ministry of Industry of Spain (Avanza Program), and we participate in several research networks, for example, HiPEAC (funded by EU), CAPAP-H (Ministry of Education of Spain), Spanish network for e-science (Ministry of Education of Spain), Middleware Grid (Ministry of Educa- tion of Spain), GIS network (Office of R &D of Galicia), Mathematica Consulting and Computing (Office of R &D of Galicia), Galician Network on High Performance Com- puting (GHPC, Office of R &D of Galicia), Galician Network on Parallel and and Grid Technologies (RedeGRID, Office of R &D of Galicia)

3.2 Subproject 2. UDC group • Level of achievement of the objectives. Very high. As shown in previous section, the objectives tackled by the UDC group have been achieved.

• Relevance and originality of the results. The relevance and originality of the results has been outlined in the previous section

• Publications. During the development of the project, 2007-2010, we have published 16 papers in international journals, 62 international conference papers, and edited 4 special issues of international journals.

• Usefulness of the results and relationship with the social environment. The research line about Programmability in HPC systems using PGAS approaches has promoted a contract funded by Hewlett-Packard that is currently being developed with participation of the GAC-USC and CESGA. Also, the UPC community has shown a significant interest in the UPC collectives microbenchmark suite. This has motivated the upcoming public release of the suite, in collaboration with the HP Labs at Marlboro (USA). Regarding the research line Fault tolerance of high-performance applications, CPPC performance has been evaluated in a public supercomputing infrastructure hosted in CESGA using real applications in use in the center. There is an increasing interest from the CESGA in fault tolerance tools in order to maximize profit from available resources. CPPC is the only publicly available portable checkpointer for message-passing applications. CPPC is an open-source project, available at http://cppc.des.udc.es under GPL license. Finally, improvement of Java communications have also motivated the interest of companies (in particular in the electronic trading sector, Comunytek Consultores) in the use of the outcomes of our research.

• Training capacity of the group. The CAG-UDC offers, in conjunction with the Dep. of Electronics and Computer Science of the University of Santiago de Compostela, the PhD Program Interuniversities Program in Information Technologies, which includes a specialization in HPC technologies. The Program was awarded by the Ministry of Education (since its creation in 2004) with the Quality Mention (Menci´onde Calidad, ref. MCD2004-00378) for its academic excellence, which allowed to have additional funds from MEC for invited professors, mobility of students, and funding of members of the Examination Committee of European PhDs (Menci´onde Doctor Europeo). This Program TIN2007-67537-C03

will be replaced in Sep. 2010 by the Master in High Performance Computing (currently in verification process), which will be fully focused on HPC and will use the supercomputing infrastructure provided by the Galicia Supercomputing Center (CESGA). Three PhD theses have been developed during this project, authored by Gabriel Rodr´ıguez, Guillermo L´opez (European PhD) and Iv´anD´ıaz(to be presented in early 2010). Cur- rently, five members of the group hold a predoctoral fellowship: 2 FPIs granted by this project (R. Concheiro and D. Rol´an),2 FPUs (J. Gonz´alezand J. Andi´on)obtained in the last call of 2008; and 1 predoctoral fellowship from the Galician Government (C. Teijeiro). • Collaboration with other groups. The Computer Architecture Group (CAG-UDC) has established the following collaborations during this project: Polaris Group, Dep. of Computer Science, Univ. of Illinois at Urbana-Champaign (USA); IBM T.J. Watson Re- search Center, New York (USA); Hewlett-Packard Labs, Marlboro (USA); Spiral Group, Dep. of Electrical and Computer Engineering, Carnegie Mellon University (USA); Dep. of Computer Science, Univ. of Texas at Austin (USA); WSI/GRIS Group, Dep. of ˜ 1 Graphical-Interactive Systems, University of TA 4 bingen (Germany); Computer Graph- ics Systems Group, Hasso Plattner Institut, University of Potsdam (Germany); Computer Architecture Group, School of Computer Science, Chemnitz Technical University (Ger- many); Centre for Advanced Computing and Emerging Technologies (ACET), University of Reading (UK); Dep. of Computing, University of Portsmouth (UK); Compiler and Architecture Design Group, Dep. of Computer Science, University of Edinburgh (UK); Madynes Group, INRIA-LORIA (France); OASIS Group, INRIA at Sophia Antipolis (France); Graphics Group, Dep. of Computer Science, Lund University (Sweden); Insti- tute of Computing Technology, Chinese Academy of Sciences (China); Alchemy Group, INRIA Futurs (France). The CAG-UDC has also participated actively in several HPC-related research networks, 3 of them EU-funded: HiPEAC (High-Performance Embedded Architectures and Com- pilers Network of Excellence), HiPEAC-2, and ComplexHPC (Open European Network for High Performance Computing on Complex Environments). The CAG has also led the Galician Network on High Performance Computing (http://ghpc.udc.es) funded by the Galician Government with 180.000 Euros for 2007-2009, with the aim of promoting and coordinating HPC R&D initiatives and collaborations in Galicia. This network is composed by 12 interdisciplinary research groups.

3.3 Coordination, development of the project As stated in previous sections, the project proposal comprised three subprojects and research groups: USC, UDC and CESGA; however, the project was finally approved with only of the three subprojects, USC and UDC. To coordinate the two groups we have used the same policy that has provided very satisfactory results in previous projects developed by both groups. Periodical meetings at several levels has been used to coordinate the reseach groups. The two responsible researchers of the two groups have kept almost permanent contact to analyze the evolution of every topic from a global perspective. On the other hand, the responsible researchers of every topic met to determine the best way to deal with objectives of the project. Finally, the researchers involved in every task meet frequently. TIN2007-67537-C03

Finally we would like to point out that an six months extension of the project has been approved recently and this will allows us to finish completely almost every objective in the project.

References

[1] D. Andrade, B.B. Fraguela, R. Doallo.Static Prediction of Worst-case Data Cache Performance in the Absence of Base Address Information. Proc. 15th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’09). pp. 45-54. April 2009. [2] E. Antelo, J. Villalba and E.L. Zapata. A Low-Latency Pipelined 2D and 3D CORDIC Processors, IEEE Transactions on Computers, vol. 57, no. 3, pp. 404-417, Mar., 2008. [3] M. Arenaz, P. Amoedo, J. Touri˜no:Efficiently Building the Gated Single Assignment Form in Codes with Pointers in Modern Optimizing Compilers. Proceedings of 14th International Euro-Par Conference on Parallel Processing, Euro-Par 2008, pp. 360-369, 2008. [4] M. Arenaz, J. Touri˜no,R. Doallo: XARK: An extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst., 30(6):1-56, 2008. [5] I. D´ıaz,G. Fern´andez,M.J. Mart´ın,P. Gonz´alez,J. Touri˜no.Integrating the Common Information Model with MDS4. 9th IEEE/ACM International Conference on Grid Computing, Grid 2008, pp. 298-303. Tsukuba, Japan, September 2008. [6] M. B´ooand M. Amor. Dynamic Hybrid Terrain Representation Based on Convexity Limits Iden- tification. International Journal of Geographical Information Science. 23(4): 417-439, 2009. [7] D. D´ıaz, X.C. Pardo, M.J. Mart´ın, P. Gonz´alez.Application-Level Fault-Tolerance Solutions for Grid Computing. 8th IEEE International Symposium on Cluster Computing and the Grid. CCGRID 2008. IEEE Computer Society, pp.554-559, Lyon, France, May 2008. [8] I. D´ıaz,C. Popi, O. Festor, J. Touri˜no,R. Doallo. Ontological Configuration Management for Wireless Mesh Routers. 9th International Workshop on IP Operations and Management, IPOM 2009, Lecture Notes in Computer Science, vol. 5843, pp. 116-129. Venice, Italy, October 2009. [9] C. D´ıazResco, R.R.Osorio and J.D. Bruguera. High Performance Image Processing on a Processor. Proc. 12th Euromicro Conference on Digital System Design. (DSD’2009). Patras (Greece). pp. 233- 236. 2009. [10] B.B Fraguela, Y. Voronenko, M. Puschel. Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling.18th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT’09). pp. 271-280.September 2009. [11] J. Gonz´alez-Dom´ınguez,M.J. Mart´ın,G.L. Taboada, J.Touri˜no,R. Doallo, A.G´omez.A Parallel Numerical Library for UPC. 15th International European Conference on Parallel and Distributed Computing, Euro-Par 2009, pp. 630-641. Delft, The Netherlands, 2009. [12] J. Gonz´alez-Dom´ınguez,G.L. Taboada, B.B Fraguela, M.J. Mart´ın,J.Touri˜no.Servet: A Bench- mark Suite for Autotuning on Multicore Clusters. 24th IEEE International Parallel and Dis- tributed Processing Symposium, IPDPS 2010, Atlanta, Georgia, USA, 2010 (accepted). [13] J. Guo, G. Bikshandi, B.B. Fraguela, D. Padua. Writing productive stencil codes with overlapped tiling. Concurrency and Computation: Practice and Experience, 21(1): 25-39. Janurary 2009. [14] T. Lang and J.D. Bruguera. A Hardware Error Estimate for Floating Point Computations. Proc. SPIE Conference. Advanced Signal Processing Algorithms, Architectures and Implementations XVIII. San Diego, USA. pp. 70740N1-70740N11. 2008. TIN2007-67537-C03

[15] D.R. Mart´ınez,V.Blanco, T.F.Pena, J.C.Cabaleiro, F.F.Rivera, Performance Modeling of MPI Applications using Model Selection Techniques, PDP 2010 - The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing , vol. Accepted, 2010 [16] D.R.Mart´ınez,J.C.Cabaleiro, T.F.Pena, F.F.Rivera, V.Blanco, Accurate Analytical Performance Model of Communications in MPI Applications, The 8th International Workshop on Perfor- mance Modeling, Evaluation, and Optimization of Ubiquitous Computing and Networked Systems (PMEO-UCNS’2009), 23rd IEEE International Parallel and Distributed Processing Symposium, Rome, 2009 [17] 1.R.R. Osorio and J.D. Bruguera. An FPGA Architecture for CABAC Decoding in Many-core Systems. IEEE 19th International Conference on Application-specific Systems, Architectures and Processors (ASAP 2008). Lovaina (B´elgica).pp. 293.298. 2008 [18] E. J. Padr´on,M. Amor, M. B´ooand R. Doallo. Hierarchical Radiosity for Multiresolution Systems Based on Normal Tests. Computer Journal, accepted. [19] J.C.Pichel, D.B.Heras, J.C.Cabaleiro, F.F.Rivera, Increasing data reuse of sparse algebra codes on simultaneous multithreading architectures, Concurrency and Computation: Practice and Ex- perience, vol. 21, issue 15, pp. 1838-1856, 10/2009. [20] J.C.Pichel, D.B.Heras, J.C.Cabaleiro, A.J.Garc´ıa-Loureiro,F.F.Rivera,, Increasing the locality of iterative methods and its application to the simulation of semiconductor devices, International Journal of High Performance Computing, vol. Accepted, 2009 [21] Juan C. Pichel, Juan A. Lorenzo, Dora B. Heras, J. C. Cabaleiro, T. F. Pena, Analyzing the Execution of Sparse Matrix-Vector Product on a SMP-NUMA System, Journal of Supercomput- ing, vol. Acepted, 2010 Juan A. Lorenzo, Juan C. Pichel, David LaFrance-Linden, Francisco F. Rivera, David E. Singh, ”Lessons Learnt Porting Parallelisation Techniques for Irregular Codes to NUMA Systems”, PDP2010, vol. Accepted, 2010 [22] D. Piso and J.D. Bruguera. Variable Latency Goldschmidt Algorithm based on a New Rounding Method and Remainder Estimate. IEEE Transactions on Computers. (Submitted) [23] G. Rodr´ıguez, M.J. Mart´ın, P. Gonz´alezand J. Touri˜no.Controller/Precompiler for portable checkpoiting. IEICE Transactions on Information and Systems, E89-D(2), pp. 408-417, 2006. [24] G. Rodr´ıguez,M.J. Mart´ın,P. Gonz´alez,J. Touri˜no.A heuristic approach for the automatic inser- tion of checkpoints in message-passing codes, Journal of Universal Computer Science, 15(14):2894- 2911, 2009. [25] G. Rodr´ıguez,M.J. Mart´ın,P. Gonz´alez,J. Touri˜no,R. Doallo. CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications, Concurrency and Computation: Practice & Experience (accepted). (DOI: 10.1002/cpe.1541) [26] D. Rol´an, B.B Fraguela, R. Doallo. Adaptive Line Placement with the Set Balancing Cache. Proc. 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). pp. 529-540. December 2009. [27] J. Salceda, I. D´ıaz,J. Touri˜no,R. Doallo. A Middleware Architecture for Distributed Systems Management. Journal of Parallel and Distributed Computing, 64(6):759-766, June 2004 [28] A. V´azquezand E. Antelo, ”New Insights on Ling Adders”, 19th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP08), Leuven, Belgium, p.233- 238 (2008). [29] A. V´azquez,E. Antelo and P. Montuschi. Improved Design of High-Performance Parallel Decimal Multipliers. IEEE Transactions on Computers, 2009 (accepted).