info24

Compilation appears quarterly | October 2010 Architecture and Embedded High Performance and Network of Excellence on

2 Message from the HiPEAC Coordinator 3 Message from the Project Officer 4 HiPEAC Activities: 4 - PUMPS Summer School 5 - HiPEAC Goes Across the Pond to DAC 2 HiPEAC News: 2 - HiPEAC Management Staff Change 3 - Mateo Valero Awarded Honorary Doctoral Degree by the Universidad Veracruzana, Mexico 5 - Tom Crick Takes Over HiPEAC Newsletter Proofreading Role 6 - Habilitation Thesis Defence of Sid Touati on Backend Code Optimization 7 - Collaboration on the Adaptation of Numerical Libraries to Graphics Processors for the Windows OS Welcome to 8 - Lieven Eeckhout Awarded an ERC Starting Independent Researcher Grant Computing 8 - ComputerSciencePodcast.com - A Computer Science Podcast 4 HiPEAC Announce: System Week 4 - Computer Architecture Performance Evaluation Methods 6 - Barcelona Multicore Workshop 2010 in Barcelona, 7 - Handbook of Signal Processing Systems 11 In the Spotlight: Spain, October 11 - FP7 EURETILE Project 12 - FP7 PEPPHER Project 19-22 13 - An Update on MILEPOST 16 - (Open-People) ANR Project 17 - 9 Million Euros for Invasive Computing 10 HiPEAC Start-ups 9 HiPEAC Students and Trip Reports 18 PhD News 20 Upcoming Events www.HiPEAC.net

HiPEAC’11 Conference, 24-26 January 2011, Heraklion, Greece http://www.hipeac.net/hipeac2011 Intro

Message from the HiPEAC Coordinator Koen De Bosschere

Dear friends, I hope all of you have enjoyed relaxing summer holidays and that you are ready to tackle the challenges of the coming year. This summer, the international news was overshadowed by big items like the massive oil spill in the Mexican Gulf, and the huge flooding in Pakistan. However, on August 21, I was struck by a small footnote in the news: Earth Overshoot Day, i.e. the day where the Earth’s population had consumed all resources that earth produces in 2010. This year it was three days earlier than in 2009, and 40 days earlier than in 2000. Clearly, our impact on planet Earth is still increasing. I was wondering about the impact of computing on the planet. On the one hand it definitely helps to preserve the environment, save energy, make the planet a safer and better place to live. On the other hand it clearly consumes energy, but it also seems to indirectly stimulate the accelerated consumption of natural resources by making consumer goods and services more affordable.

In June, HiPEAC underwent its second all the instructors for their excellent days in January 2011, and it is pre- review. The reviewers concluded that performance, and I would also like to ceded by a rich set of workshops and we have made good progress on the thank our colleagues from UPC and tutorials, several of which are directly recommendations of the first review, BSC for all the logistic support to make linked to the HiPEAC research clusters. and they asked us to continue our the event successful. The whole event is expected to be a efforts. They also suggest we start a In October we organize our fall major networking event for our com- more structural collaboration with the Computing Systems Week in Barcelona. munity. different Computing Systems projects Like in 2008, it is co-located with the Finally, I would like to thank Wouter in FP7 and to promote the HiPEAC Barcelona Multi-core Workshop. I hope De Raeve who handled the administra- roadmap more widely. To that end, we that we will all take the opportunity to tive and financial management of the have created a small roadmap movie network and to start working on high HiPEAC project since its beginning. that you can watch at http://www. quality project proposals for the 2011 He decided to leave the network on hipeac.net/roadmap. Call in Computing Systems. This is the August 1 for a new position within In July, many of us enjoyed the yearly last opportunity before the deadline of Ghent University. The HiPEAC project ACACES summer school in La Mola, the 2011 Call. management has since then been Barcelona. Like the previous years, Next, there is the HiPEAC Conference, taken over by Jeroen Ongenae. this flagship event of our community currently being prepared by our Greek received a high appreciation score by colleagues in beautiful Heraklion, Take care the participants. I would like to thank Crete. The conference runs for three Koen De Bosschere

HiPEAC News HiPEAC Management Staff Change As of August 1st, the HiPEAC project his excellent work and wishes him suc- manager Wouter De Raeve has left cess in his new position. the network. Wouter managed all the administrative and financial aspects of Wouter has been replaced by Jeroen HiPEAC. He also helped with the logis- Ongenae. Jeroen has a Master degree tics for several HiPEAC activities, such in History, and he works in the same as the cluster meetings, the conference, department for several years. He was and the summer school. Wouter has already familiar with HiPEAC, so the Wouter De Raeve Jeroen Ongenae moved on to another job in the central transition should go smoothly. If you administration of Ghent University. The have questions, do not hesitate to con- HiPEAC community thanks Wouter for tact him at [email protected].

2 info24 Panos Tsarchopoulos [email protected]

Message from the Project Officer

Over the last few years, the European research landscape in a way that is have started in 2010 under the same Commission has consistently increased quite unique worldwide. In European research headings together with a new the amount of funding going to collaborative research projects, interna- Future and Emerging Technologies research in computing architectures tional consortia consisting of universi- initiative on “Concurrent Tera-device and tools with special emphasis on mul- ties, companies and research centers, Computing”. ticore computing. Tracing back to the are working together to advance the recent past, the European Commission state of the art in a given area. In Call 7, which is open now, there has funded the HiPEAC Network of are two relevant research objectives, Excellence which commenced in 2004. In 2006, the European Commission one under the heading “Computing Networks of Excellence are an instru- launched the Future and Emerging Systems” with 45 million euros fund- ment to overcome the fragmentation Technologies initiative in Advanced ing and the other under the heading of the European research landscape in Computing Architectures as well as a “Exascale Computing” with 25 million a given area; their purpose is to reach number of projects covering Embedded euros of funding; deadline for proposal a durable restructuring/shaping and Computing. In 2008, a new set of submission is January 2011. integration of efforts and institutions. projects were launched to address the challenges of the multi/many core It is now crucial that the research com- In the following years, numer- transition - in embedded, mobile and munity delivers high-impact proposals ous European collaborative research general-purpose computing - under that will reinforce European excellence. projects in Computing have started. the research headings “Computing We need to speed ahead on a higher It is important to note here that col- Systems” and “Embedded Systems”; gear. laborative research projects are the these projects were complemented major route of funding in the European by a second wave of projects that Panos Tsarchopoulos

Mateo Valero Awarded Honorary Doctoral HiPEAC News Degree by the Universidad Veracruzana, Mexico

The Universidad Veracruzana (UV), In his acceptance speech, Valero located in the Mexican state of Veracruz, thanked the University for bestowing conferred the honorary doctoral degree the honor and recalled his ties with to Professor Mateo Valero, director of Mexico dating back to his childhood the Barcelona Supercomputing Center and numerous trips to cities around (BSC). The president of the Universitat the country. He also emphasized the Politècnica de Catalunya (UPC), Antoni role of research as one of the main Giró, delivered the oration for Professor drivers of the country’s economy. “A Valero. country must produce wealth in a sus- the success achieved by Valero during At the ceremony, held on May 28, tainable manner if it is to offer a solid his 36-year career, which has been UV also awarded honorary degrees education and good healthcare to the characterized by “dedication and excel- to the Spanish stage director Manuel entire population. The research being lence, dedication and generosity, dedi- Montoro Tuells and expert on educa- conducted also provides answers to cation and imagination, dedication and tion and democracy, Gilberto Guevara the complexities facing modern socie- enthusiasm, dedication and presence, Niebla, from Mexico. ties. In short, we need to focus on and dedication and integrity, dedication The Universidad Veracruzana hon- nurture our research.” and effort, effort and dedication.” ored the significant contributions UPC president Antoni Giró, full pro- For UV, each honorary degree award- Mateo Valero has made to Computer fessor at the Department of Physics ed during the ceremony recognizes Architecture, citing him as among the and Nuclear Engineering, delivered the an invaluable contribution to society. field’s top contributors of the past 25 encomium honoring the director of Therefore, the honorees join the senate years. the BSC. In his address, Giró praised of the UV.

info24 3 HiPEAC Activity

Graph, Tiling, Grid, Montecarlo, FDTD, PUMPS Summer School Sparse matrices, etc. The school fea- tured lectures on cutting-edge new techniques and hands-on laboratory experience in developing applications for many-core processors with massive- ly resources such as the GPU accelerators, highlighting the new Fermi architecture. Applications case studies focused on Molecular Dynamics, MRI Reconstruction, RTM Stencil, and Dense Linear Systems. The social activities, the presentation of the GPUComputing.net community as well a showcase of high-end equip- ments by SuperMicro, AZKEN, and Bull, promoted the participants’ interaction The PUMPS 2010 Summer School, of NVIDIA and Prof. Wen-mei Hwu and the seed for future collaborations. “Programming and tUning Massively of the University of Illinois. PUMPS The PUMPS “NVIDIA Best Achievement Parallel Systems (PUMPS),” was offered attracted more than 250 applications, Award” Contest was launched as a in July 5-9, 2010 at the Universitat but could only host 100 attendees continuation for the attendees to opti- Politecnica de Catalunya (UPC) cam- from throughout the EU and beyond, mize four proposed challenges using pus. The event was co-sponsored by from beginners to advanced program- what they learned at the summer the Barcelona Supercomputing Center, mers and faculty. The content was school. UPC, the University of Illinois, HiPEAC based on CUDA, OpenCL, OpenMP, NOE and NVIDIA, with distinguished and StarSs programming languages More information at: faculty members Dr. David B. Kirk and numerical methods including FFT, http://bcw.ac.upc.edu

Computer Architecture Performance HiPEAC Announce Evaluation Methods Morgan & Claypool Publishers Synthesis Lectures on Computer Architecture by Lieven Eeckhout Ed. Mark D. Hill

research and development in for exploring processor architectures. the right direction. However, rigorous The book focuses on fundamental performance evaluation is non-trivial concepts and ideas for obtaining as there are multiple aspects to per- accurate performance data. The book formance evaluation, such as picking covers various topics in performance workloads, selecting an appropriate evaluation, ranging from performance modeling or simulation approach, metrics, to workload selection, to vari- running the model and interpreting ous modeling approaches including the results using meaningful met- mechanistic and empirical modeling. rics. Each of these aspects is equally And because simulation is by far the important and a performance evalu- most prevalent modeling technique, Performance evaluation is at the ation method that lacks rigor in any more than half the book’s content is foundation of computer architec- of these crucial aspects may lead to devoted to simulation. The book pro- ture research and development. inaccurate performance data and may vides an overview of the simulation Contemporary microprocessors are drive research and development in techniques in the computer designer’s so complex that architects cannot a wrong direction. The goal of this toolbox, followed by various simula- design systems based on intuition book is to present an overview of the tion acceleration techniques including and simple models only. Adequate current state-of-the-art in computer sampled simulation, statistical simula- performance evaluation methods architecture performance evaluation, tion, parallel simulation and hard- are absolutely crucial to steer the with a special emphasis on methods ware-accelerated simulation.

4 info24 HiPEAC Activity

HiPEAC Goes Across the Pond to DAC

DAC (Design Automation Conference) by Ghent University and RWTH Aachen is the world’s leading technical confer- University – the booth was decorated ence and tradeshow of the electronic with shining posters, latest newsletters/ design and design automation, truly HiPEAC leaflets, a running presentation “the show of the year”. This year’s on a LCD stand-point, and – for the first edition was held in Anaheim in the time of HiPEAC booth history – givea- week of June 13-18, which as usual ways. We prepared boomerangs with attracted thousands of attendees from HiPEAC logo and URL, which turned academia and industry (official count: out a great success and provided for 5920 participants). The venue, Anaheim good traffic at DAC’s trade-show with Convention Center, is a large modern its very commercial nature. With the facility to house both the trade-show in careful booth location selection, we the ground floor and the comprehen- were close to the main road of the sive technical conference parts in the hall and direct neighbor of the gigantic floors above. The great exhibition fea- Synopsys camp, attracting hundreds tures ingredients from nearly all major of visitors during the exhibition days. players in the industry. The confer- We are proud to say that, as the only ence boasts with a significant number present European project setting flag in of excellent technical papers. Vibrant DAC, the booth has delivered key mes- panels and user track sessions on-site, sages of the mission and functionalities live broadcasting of FIFA World Cup of the HiPEAC network. It was well games, freebies and lotteries everyday, received and we had many new visitors, southern California weather and many especially from the US, everyday learn- Weihua Sheng and Felix Engel introducing others more contributed to another ing the relative new concept of NoE HiPEAC to DAC attendees great success of DAC. (Network of Excellence), the research clusters of HiPEAC and expressing inter- follow-up communications HiPEAC Following the successful booth shows ests in joining the HiPEAC events. will make new friends and see many at the past two DATE events, HiPEAC new faces in the upcoming events. It has decided to go across the pond The HiPEAC booth in DAC stepped up was a wonderful experience running this time to get increased international beyond its European border to extend the booth, greeting HiPEAC friends exposure and expand its impact. The its visibility and influence to a larger and talking to the visitors in a great booth preparations were carried out community. It is expected that through summer!

HiPEAC News Tom Crick Takes Over HiPEAC Newsletter Proofreading Role

Igor Böhm, a PhD student at Edinburgh University of Wales Institute, Cardiff University, who has been the HiPEAC (UWIC), having previously been a PhD newsletter proofreader for over a research student and post-doctoral year, has recently left this role due researcher at the University of Bath. to other commitments. Dr Tom Crick The HiPEAC community thanks Igor has volunteered to replace Igor. Tom for the reliable service he performed is a recent member of HiPEAC and and welcomes Tom to the newsletter is a Lecturer in Computing at the production team! Igor Böhm Tom Crick

info24 5 HiPEAC News

Habilitation Thesis Defence of Sid Touati

on Backend Code Optimization Sid-Ahmed-Ali Touati

This habilitation thesis was defended straints. Consequently, they can be The interaction between memory hier- in June 30th, 2010 at the University neglected from the instruction sched- archy and instruction level parallelism of Versailles Saint-Quentin en Yvelines uling process. is of crucial issue if we want to hide (France). We start our document by or to tolerate load latencies. Firstly, a global view on the phase order- In order to guarantee the absence of we practically demonstrate that super- ing problem in optimising compilation. spilling before instruction scheduling, scalar out-of-order processors have Nowadays, hundreds of compilation we introduce the SIRA framework. It a performance bug in their memory passes and code optimisation methods is a graph theoretical approach that disambiguation mechanism. We show exist, but nobody knows exactly how bound the maximal register need for that a load/store vectorisation solves to combine and order them efficiently. any subsequent software pipelining, this problem for regular codes. For Consequently, a best effort strategy while saving instruction level paral- irregular codes, we study the combina- consists in doing an iterative compi- lelism. SIRA model periodic register tion of low-level data pre-loading and lation by successively executing the constraints in the context of multiple prefetching, designed for embedded program to decide about the passes register types, buffers and rotating VLIW processors. and optimisation parameters to apply. register files. We provide an efficient We prove that iterative compilation heuristic that show satisfactory results Finally, with the introduction of multi- does not fundamentally simplify the as a standalone tool, as well as an core processors, we observe that pro- problem, and using static performance integrated compilation pass inside a gram execution times may be very models remains a reasonable choice. real compiler. variable in practice. In order to improve the reproducibility of the experimental A well-known phase ordering dilem- SIRA defines a formal relationship results, we design the -Test, ma between register allocation and between the number of allocated reg- which is a rigorous statistical protocol. instruction scheduling has been debat- isters, the instruction level parallelism We rely on well known statistical tests ed for long time in the literature. and the loop unrolling factor. We use (Shapiro-Wilk’s test, Fisher’s F-test, We show how to efficiently decouple this relationship to write an optimal Student’s t-test, Kolmogorov-Smirnov’s register constraints from instruction algorithm that minimises the unroll- test, Wilcoxon-Mann-Whitney’s test) scheduling by introducing the notion ing factor while saving instruction to evaluate if an observed speedup of of register saturation (RS). RS is the level parallelism and guaranteeing the the average or the median execution maximal register need of all the pos- absence of spilling. As far as we know, time is significant. sible schedules of a data dependence this is the first result in the literature graph. We provide formal methods for proving that code size compaction and its efficient computation, that allows code performance are not antagonistic us to detect obsolete register con- optimisation objectives.

HiPEAC Announce Barcelona Multicore Workshop 2010

The Barcelona Supercomputing the developments in the interven- ucts: how to maximize the effective- Center, HiPEAC and Microsoft ing two years. The Second Barcelona ness of many-cores. BMW2010 will Research are proud to organize Multicore Workshop (BMW) is part consist of a two-day workshop with the Second Barcelona Multicore of the upcoming HiPEAC Barcelona invited talks, panel sessions and time Workshop (BMW), which will Computing Systems Week. for discussion. take place in Barcelona on 21-22 Our goal is to bring together promi- More updated information at: October 2010. Building on the suc- nent researchers active to cross-fer- http://www.bscmsrc.eu/media/events/ cess of the highly productive 2008 tilize ideas across the hardware and barcelona-multicore-workshop-2010 Barcelona Multicore Workshop, software communities and discuss the BMW2010 strives to examine critically cutting edge in research and prod-

6 info24 HiPEAC News

Collaboration on the Adaptation of Numerical Libraries to Graphics Processors for the Windows OS

Prof. Gregorio Quintana, from the kind of processors, made the group computations. This software is used Department of Computer Science and worthy of the Professor Partnership in numerous scientific and engineer- Engineering of the Universidad Jaume Award from NVIDIA, the leading ing applications, such as in analysis I (UJI) will lead a group of research- worldwide manufacturer of graphic of building structures, computational ers working during the next year in a cards for computers. physics, chemistry simulations, with contract working with Microsoft with the use of a graphics processor yields the aim of adapting numerical librar- This award and the collaboration a considerable reduction of the time- ies for graphics processors under the they maintain with The University of to-solution in this area. Windows operating system. Texas at Austin have given the group the possibility to work directly with The members of the HPCA group The High Performance Computing & Microsoft. The company was inter- are currently working on energy sav- Architectures (HPCA) research group ested in accessing the technology of ing techniques in order to control of the UJI, to which 14 researchers graphic processors developed by the the power consumption of large-scale belong, has been working for some group from Castellón, which, at that computer installations. The group years in the development of numerical point, was one of the few groups has also collaborated with organiza- libraries of programs for supercom- developing such technology. tions such as NASA to adapt numeri- puters, and lately, in multicore proces- cal libraries to space vehicles; with sors and graphics processors. The programs developed by the HPCA Chicago Hospital where they used group are used in millions of com- the group’s programs for magnetic A couple of years ago, the HPCA puters in applications and software resonance to diagnose arterial block- group devoted a specific research line packages such as Matlab, Labview, ages; with Boeing to model airplanes; to the study of graphics processors Octave and Lapack. These programs and with researchers from com- and the research of new function- are currently the fastest solvers for panies including Intel, NVIDIA and alities. The work developed in this complex calculus operations as lin- ClearSpeed. area, with the creation of optimized ear-least squares problems, numeri- software for the matrix calculus in this cal rank computations or eigenvalue HiPEAC Announce

Handbook of Signal Processing Systems (ISBN: 978-1-4419-6344-4) By Shuvra S. Bhattacharyya, Ed F. Deprettere, Rainer Leupers, and Jarmo Takala

The first part motivates representa- Topics covered: tive applications that drive and apply • Applications; state-of-the-art methods for design • Architectures; and implementation of signal process- • Programming and Simulation ing systems; the second part discusses Tools; architectures for implementing these • Design Methods applications; the third part focuses on compilers and simulation tools; and the fourth part describes models The Handbook of Signal Processing of computation and their associated Systems provides a standalone, com- design tools and methodologies. Each plete reference to signal process- part has 6 - 12 chapters from leading ing systems organized in four parts. experts.

info24 7 HiPEAC News

Lieven Eeckhout Awarded an ERC Starting Independent Researcher Grant

Contemporary microprocessors seek to during many- execution. The abil- acterization in particular. He received improve performance through thread- ity to track per-thread progress enables two IEEE Micro Top Picks Awards and level parallelism by co-executing multi- system software to deliver dependable recently wrote a synthesis lecture on ple threads on a single microprocessor performance by assigning hardware “Computer Architecture Performance chip. Projections suggest that future resources to threads depending on their Evaluation Methods”. He has success- processors will feature multiple tens relative progress. Through this coopera- fully graduated six PhD students and to hundreds of threads, hence called tive hardware-software approach, this current supervises three post-doctoral many-thread processors. Many-thread project addresses a fundamental prob- researchers and eight PhD students. He processors, however, lead to non- lem in multi-threaded and multi/many- also participates in the ExaScience Lab, dependable performance: co-executing core processing. The project consists of part of Intel Labs Europe, focusing on threads affect each other’s perform- funding for five years for three PhD stu- architectural simulation techniques for ance in unpredictable ways because dents and one post-doctoral researcher. exascale systems. of resource sharing across threads. Failure to deliver dependable perform- ERC Starting Grants ance leads to missed deadlines, priority The European Research Council (ERC), inversion, unbalanced parallel execu- through its ERC Starting Independent tion, etc., which will severely impact Researcher Grants, aims at supporting the usage model and the performance “up-and-coming research leaders who growth path for many important future are about to establish or consolidate and emerging application domains a proper research team and to start (e.g., media, medical, datacenter). conducting independent research in This project on Dependable Performance Europe. The scheme targets promis- on Many-Thread Processors envisions ing researchers who have the proven that performance introspection using potential of becoming independent a cycle accounting architecture that Lieven Eeckhout is an Associate research leaders. It will support the cre- tracks per-thread performance, will be Professor at Ghent University, Belgium. ation of excellent new research teams the breakthrough to delivering depend- His main research interests include com- and will strengthen others that have able performance in future many-thread puter architecture and the hardware/ been recently created.” ERC grants processors. To this end, it will develop a software interface in general, and per- are considered the most competitive hardware cycle accounting architecture formance modeling and analysis, simu- research grants in Europe. that estimates single-thread progress lation methodology and workload char- ComputerSciencePodcast.com HiPEAC News A Computer Science Podcast

I sometimes pre- I have lots of these sciencey podcasts; will go near us? I refused to believe it! tend to exercise, you can find them all easily enough though to be hon- and some of them are pretty good. So, I looked for a computer science est I don’t pretend But I tell you, after listening to a few of podcast. Just one would have done. very hard. But them I find myself thinking that if I have I looked pretty hard and asked eve- when I am pant- to hear another article about global ryone at the school about it, too. ing and wheezing on the treadmill or warming I shall set fire to an oil well. Nothing! Oh, there were a few people cursing the hills on my bike that didn’t Don’t get me wrong, I believe in climate who’d put their undergrad lectures look so steep in the car, I like to listen to change and all that, but damn it, I’m online and some people doing engi- science podcasts. It takes my mind off a geek! Why do they never, ever say neering and gadget podcasts, but for the pain. Well, no it doesn’t really, but anything about computers? Is our work scientists? Nothing. you know what I mean. so dull or difficult to explain that no one

8 info24 Foiled in my quest and being charac- And it’s happened. We got together teristically lazy you would expect me and recorded news items, some paper Hugh Leather to call it a day, and I would have, too. reviews and I even managed to wan- Except I have this weird mental glitch gle an interview with Stephen Cook that sometimes takes over and makes - a Turing Award recipient in our first puter science joke of the month to me do things I ordinarily wouldn’t. show! It was pretty nerve wracking in entertain you. We really hope you like This glitch (which some might call the beginning, but fortunately for us it, and when you do listen to it, drop `procrastination from writing up his (and you when you listen to it) my next us a note to let us know what you PhD’) decided that if no one else was door neighbour is a free lance sound think about it, even if it’s just to say going to do it, then I would just have engineer who has made our stuttering you can’t stand the sound of my voice. to suck it up and start doing it my self disappear and, though I say it myself, Enjoy listening! - I would make my own computer sci- by the time he’d finished with us we ence podcast! I spoke to a few friends, come across pretty professionally! You can download the podcast or sub- and after the inevitable questions of scribe to the RSS feed on www.com- “You’re actually serious, aren’t you?” Since then, we’ve recorded another putersciencepodcast.com. There’s also had been dealt with I was amazed that show and will be doing a third soon. a Facebook group (“CompuCast”) they wanted to come on board, too. In It looks like this podcast, with its where you can get all the latest infor- fact, nobody thought this was a barmy monthly shows, will take on a life of mation about the podcast. You can idea at all. Before I knew it we were its own and will keep getting better also post any suggestions, comments looking very much like we were going and better. You’ll hear about all sorts or feedback there, or email mail@ to be recording our first episode for a of computer science in the show and computersciencepodcast.com. We’re computer scientist podcast! you should find something you like. always on the look out for good sto- There’s always the quiz and the com- ries, too.

Towards Joint Research between HiPEAC Students Augsburg and Sibiu

My name is Ralf Jahr and I am a PhD In August 2010 I visited Horia Calborean complexity of a configuration of the student of Prof. Theo Ungerer at the for three weeks. He is a PhD student GAP. To be able to cope with the need- University of Augsburg, Germany. The of Prof. Lucian Vintan and works for ed computation power and to decrease focus of my work is on software opti- Advanced Computer Architecture & the time needed to complete the high mizations for the Grid Alu Processor Processing Systems (ACAPS) at Lucian number of simulations required by DSE (GAP). The GAP is a novel approach Blaga University of Sibiu in Romania. algorithms we extended FADSE to run to speed up the execution of sequen- Horia Calborean is developing a tool, in a distributed manner on multiple tial single-threaded code. Therefore, it called FADSE, which performs auto- computers in a network. In addition, combines a superscalar front-end with matic design space exploration on mul- to be able to manage network errors a novel configuration unit and a coarse- tiobjective problems. He focuses on and errors in the used software com- grained reconfigurable grid of func- multi- and many-core processor simula- ponents, we increase the robustness of tional units (FUs). No special software tors and his objective is to find a near FADSE on multiple levels. is required to prepare a program for the Pareto-optimal set of configurations. execution on the GAP because place- FADSE is available for free. In the next weeks and months we will ment and routing are both performed continue our work. The goals we try to in hardware. Because we do not want During my visit we worked on bringing achieve are various. First, we want to to loose the advantage of executing together FADSE and GAP/GAPtimize. show that the GAP is a scalable proces- legacy programs without having access To be able to do so we had to expose sor architecture in terms of hardware to their source code we decided to the parameters of GAP and GAPtimize complexity. Second, we want to elabo- implement target-specific code opti- to FADSE. We had to choose multi- rate the different challenges posed to mizations in a post-link optimizer. This ple objectives that would measure the FADSE by using it for the DSE of hard- optimizer is called GAPtimize and offers quality of the configurations generated ware or code-optimization parameters. for example static speculation, function by FADSE. In this process, we created Furthermore, we want to show that inlining and branch predication. a model to approximate the hardware it is able to cope with them. We also

info24 9 want to be able to use it for future publish our current and future results standing and admirable hospitality of research to find good parameters for in one or more papers. my hosts Prof. Lucian Vintan and Horia novel code-optimizations or hardware- Calborean, Ciprian Radu, and Arpad software co-design approaches. Third, I would like to thank the former HiPEAC Gellert, I was also able to learn a lot it shall be possible to use FADSE Cluster on Adaptive Compilation for about Romania and enjoyed a very as platform to evaluate various DSE generously supporting my trip to pleasant stay there. algorithms with the different simula- Romania. It was a great opportunity tors linked to and reference problems to establish a collaboration that is very Ralf Jahr available in FADSE. Of course we will likely to be successful. Due to the out- Ateji, Language Tools for Software HiPEAC Start-ups Productivity

Ateji is a Paris- brings expressive power and efficiency based software while leveraging existing source code, start-up who development processes and developer’s recently intro- training. A major goal in the design of duced Ateji PX, Ateji PX is making parallel program- an extension of ming accessible to a much larger audi- the Java language with syntactic con- ence of application developers. structs for parallel programming. With a background and unique expertise Efficiency (program speed) is achieved in design and for much lower development cost and technology, Ateji aims at bringing novel time. First, using the Java language and innovative software solutions to with a recent JIT (just-in-time) compiler the world of HPC, including multi-core, provides performance similar to C/C++, grid, cloud and GPU programming. with code generated and optimized on- Matrix multiplication example with the parallel bars the-fly for a specific hardware configu- The Ateji PX language is built upon on ration, while avoiding memory alloca- the mathematical foundations of the tion bugs and multi-target compilation pi-calculus. This makes parallel pro- nightmare associated with low-level and cloud) where message passing is gramming simple and intuitive, close to languages. For this reason, Java has integrated at the language level and the way we “think” parallel, in addition been adopted for the handling of huge mapped by the compiler on any chosen to being analyzable by the compiler. A data volumes in scientific data centers infrastructure, such as sockets or MPI. A handful of cleverly designed constructs such as ESA and CERN. version dedicated to GPU accelerators is make it possible to express within a also in preparation. Because the com- single language a wide range of pat- Second, while the Java threading model piler understands parallel composition terns, including data-, task-, recur- remains a very low-level abstration, par- and message-passing, the same source sive- and speculative parallelism, on allelizing Java code with Ateji PX often code can be ported to very different shared-memory or distributed architec- amounts to inserting only a couple of architectures without changes. tures, and paradigms such as data flow, parallel bars (“ || ”) in the source. For stream programming, MapReduce and the matrix multiplication example, a Ateji PX has been selected for presen- the Actor model (all these examples are speedup of 12.5x on a 16-core server tation in the Disruptive Technologies available in the Ateji PX distribution). was achieved with the single addition exhibit at SC’10 in New Orleans. of a parallel bar in the source code. Ateji PX is a plugable language exten- Ateji PX performs most of its work at Whitepapers, online demos and evalu- sion compatible at the source level compile-time and incurs practically no ation licenses are available at http:// with Java - the most popular language overhead at run-time. www.ateji.com/px nowadays - with support in the state- Academic licenses are free. of-the-art Eclipse IDE. Unlike libraries, Ateji is currently developing implemen- preprocessors or brand new languages, tations of the Ateji PX language for Patrick Viry Ateji’s language extension approach distributed architectures (grids, clusters [email protected]

10 info24 In the Spotlight

FP7 EURETILE Project: EUropean REference TILed architecture Experiment ing 3 levels of connection hierarchy, Coordinator: namely neural columns, cortical areas Pier Stanislao Paolucci and cortex. EURETILE leverages on the Istituto Nazionale di Fisica working SW and HW prototypes of Nucleare (INFN) the innovative multi-tile HW paradigm Website: www.euretile.eu and SW tool-chain developed by the programming and automatic map- Partners: FET-ACA SHAPES Integrated Project ping tool (DOL/DAL). The Software INFN (IT) (2006-2009). This background knowl- for Systems on Silicon (SSS) group ETH Zurich (CH) edge includes working tile silicon and of the ISS institute of RWTH Aachen RWTH Aachen University (DE) board, a multi-tile simulator (running University, investigates and provides UJF-TIMA (FR) up to eight tiles), and a complete SW the parallel simulation technology. TARGET (BE) tool-chain including a parallel pro- The TIMA Laboratory of the University Duration: 3 years gramming and an automatic mapping/ Joseph Fourier in Grenoble explores optimization environment (Distributed and deploys the HdS (Hardware

EURETILE investigates and implements Proposed tiled-architecture brain-inspired foundational innova- tions to the system architecture of massively-parallel tiled computer archi- tectures and the corresponding pro- gramming paradigm. The execution target is a fault-tolerant many-tile HW platform, equipped with a many-tile simulator. A set of SW process - HW tile mapping candidates are gener- ated by the holistic SW tool-chain using a combination of analytic and bio-inspired methods. The Hardware- dependent Software is then gener- ated, providing OS services with maxi- mum efficiency/minimal overhead. The many-tile simulator collects profiling data, closing the loop of the SW tool chain. Fine-grain parallelism inside processes is exploited by optimized Operation Layer), a specialized OS dependent Software) including the intra-tile compilation techniques, but (DNA-OS automatically generated for distributed OS architecture. TARGET the project focus is above the level of both RISC and VLIW floating-point Compiler Technologies, the Belgian the elementary tile. The elementary numerical processor) integrated with leading provider of retargetable soft- HW tile is a multi-processor, which Linux RT, and an optimizing compiler ware tools for the design, program- includes a fault tolerant Distributed co-designed with the floating-point ming, and verification of application- Network Processor (for inter-tile com- engine. INFN (Istituto Nazionale di Fisica specific processors (ASIPs), is in charge munication), a floating-point numeri- Nucleare) is the coordinator of the of the optimizing C compiler for cal engine (for computations) and a project. The APE Parallel Computing custom components of the EURETILE RISC processor (for control, user inter- Lab of INFN Roma is in charge of architecture. face and sequential computations). the EURETILE HW Architecture and Furthermore, EURETILE investigates Scientific Application Benchmarks. The Project Coordinator and implements the innovations for Computer Engineering and Networks Pier Stanislao Paolucci equipping the elementary HW tile with Laboratory (TIK) of ETH Zurich (Swiss Istituto Nazionale di Fisica Nucleare, high-bandwidth, low-latency brain- Federal Institute of Technology) Roma like inter-tile communication (emulat- designs the high-level explicit parallel [email protected]

info24 11 In the Spotlight

FP7 PEPPHER Project: Performance Portability and Programmability for Heterogeneous Manycore Architectures

performance or preserving specified Coordinator: Sabri Pllana, performance requirements when University of Vienna (AT) porting from one, possibly hetero- Website: www.peppher.eu geneous many-core architecture, to Partners: another are performed automatically University of Vienna (AT) by static and dynamic reorganizations Chalmers University (SE) of programmer identified components Codeplay Software Ltd. (UK) of the code. PEPPHER addresses this INRIA (FR) challenge by a component annota- Intel GmbH (DE) tion and coordination language with Linköping University (SE) which the programmer identifies Movidius Ltd. (IE) components of the application, speci- Karlsruhe Institute of Technology fies their interaction and performance (DE) requirements, possibly by relating to Duration: 3 years abstract machine models stored in a based on performance information PEPPHER repository. PEPPHER assumes derived from the component annota- that these components may already tions, and/or from the history based The pervasiveness of multi-core proc- have been parallelized using a con- performance information. The static essors in Europe impacts on the whole ventional language or framework. and dynamic optimization possibili- spectrum of systems from embed- ties exposed through the annotation ded and general-purpose to high-end A major challenge in the project is language that permit transformation computing systems. However, lever- to make this parallelization-agnostic and compilation into component vari- aging the inherent performance of approach work with different existing ants, are complemented by compila- such multi-core systems, in particular parallelization models and languages. tion techniques and efficient adaptive heterogeneous many-core systems, The coordination of components ena- algorithms library support. For special, with a reasonable effort is a challeng- bles auto-tuning at a high-level of commonly used, or performance-crit- ing task. abstraction by finding the optimal ical components, a framework for library support of auto-tuned algo- rithms that can adapt to different types of many-core architectures is likewise being developed. Finally, PEPPHER is investigating hardware support for performance portability and programmability based on a sim- ulator developed in the project.

Coordinated by the University of Vienna, PEPPHER unites forces of

Consortium members the Universities of Chalmers and Linköping, Karlsruhe Institute of The three-year European FP7 project composition of the identified compo- Technology, an INRIA research cen- PEPPHER, which started in January nents. The static optimization of more tre, European SMEs Codeplay and 2010, addresses the vital issue of or less coarse-grained components is Movidius, and one of Intel’s European maintaining application performance complemented by an efficient, heter- labs. portability in today’s different, highly ogeneous run-time scheduling system complex, often heterogeneous parallel that is able to select the most promis- Project Coordinator many-core architectures. This means ing implementation of a component Prof. Sabri Pllana, that the necessary adaptations of an and schedule this among the imme- University of Vienna, Austria application to ensure good absolute diately available processing elements, [email protected]

12 info24 In the Spotlight

An Update on MILEPOST

released in June 2009. Early results MILEPOST also needs comprehensive showed that average performance tutorial and porting documentation, benefits ranged from 11% to 40% so it can easily be installed using for the different processors, with the standard tools, ported to new archi- best programs more than doubling in tectures and run with private data- speed. That is comparable with the bases. results from iterative compilation, but with a compiler that is as easy and We have outlined a new three phase quick to use as standard GCC. project to achieve these goals. 1. Phase One will develop the new The original MILEPOST project, a col- Collective Tuning and infrastructure needed in GCC and laboration supported by HiPEAC fin- Recent Work on MILEPOST modify MILEPOST to use it over six ished last year. This is an update on Since the completion of the origi- months work since then. nal MILEPOST project, the Collective 2. Phase Two will push those changes Tuning website (www.ctuning.org) through the GCC adoption proc- The Challenge has acted as the repository for the ess, a part-time effort of up to two Modern compilers offer hundreds of technology. years. possible optimization passes but not MILEPOST was developed as a generic 3. Phase Three will run concurrently all optimizations work well on all solution, although the first released with Phases One and Two mak- code. Selecting the best optimization version was for GCC. For MILEPOST ing MILEPOST easier to use, and setting is a huge challenge to the to be widely adopted, we need it to demonstrating its portability on programmer. use standard features of compilers partner’s architectures. The solution is iterative compilation. such as GCC. This will involve both Embecosm is able to provide the The user repeatedly compiles their changes to MILEPOST and changes to resource to do the work, and will program, trying different sets of opti- standard GCC. underwrite a proportion of the cost mization options and finally selecting Over the past year, Jörn Rennecke of of that resource. However we need the best set. This can sometimes dou- Embecosm has been working with a lead technology partner, with an ble performance, but is too laborious INRIA to update MILEPOST, supported investment of up to e70k over the for everyday use. by HiPEAC. Rather than use GCC 2-3 year lifespan of the project and plugins, we have adopted the GCC other technology partners with an What is MILEPOST? target hook infrastructure, which is investment of up to e30k during MILEPOST “learns” the characteristics more suited to MILEPOST. At present Phase Three of the project. of programs and associates them with adding a new target hook is a very We are in discussions with HiPEAC the best set of optimization passes. laborious task, so GCC needs to be and INRIA over matched funding to The infrastructure provides a data- updated to make target hook imple- reduce the contribution needed from base to hold information “learned” mentation simpler. the partners. and uses this to select the optimiza- Jörn Rennecke achieved two goals tion passes to be used with any new during the past year: A formal announcement about this program. • MILEPOST GCC was updated to project will follow shortly. In the The database is trained using a rep- GCC 4.5. meantime, organizations may express resentative set of programs, which • The basic improvements to the their interest by contacting Dr Jeremy are compiled many times with differ- target hook infrastructure were Bennett at Embecosm, jeremy.ben- ent options and their performance accepted into GCC mainline. [email protected]. measured. Standard machine learning techniques are used to construct a The Future: MILEPOST II A more detailed version of this report probabilistic model, from which pre- The target hook improvements to is available on Embecosm’s website dictions for new (unseen) programs GCC are only a first step. MILEPOST at www.embecosm.com/articles/ear4/ can be made. also needs a robust multi-target infra- milepost-update.pdf. The approach is designed to be gener- structure in GCC. Having achieved ic, but has initially been implemented these changes to GCC, we must mod- Jeremy Bennett for GCC, with MILEPOST GCC 4.4 ify MILEPOST to use them.

info24 13 HiPEAC Start-ups

Parallelizing C Code for Embedded Systems

The need for more speed with less code or the algorithms will power means we are no longer allow you to massage the increasing clock frequencies. Instead, program into something we are keeping frequencies constant, more manageable. but increasing the amount of things Given a program that can that can be done at the same time. It be parallelized, there are is all about going parallel, especially typically many ways to within embedded systems. do that. vfAnalyst works Parallelizing C programs is particu- not only by displaying the larly challenging because pointers are dependencies, which give allowed to run rampant, creating data visual cues as to where to dependencies that cannot be ana- break the code, but also Application scheduling view lyzed statically. Whatever C’s faults, by showing the cost and benefit of however, the fact remains that C various loop parallelization strategies. This visibility into your program – is extremely prevalent in embedded The key here is that vfAnalyst is not a abstracting out lots of unnecessary systems and shows no signs of going magic intelligent tool that can figure detail and making the code accessible away. The problem cries out for tools out how best to parallelize the pro- even if you did not write it and do to help make sure that multi-threaded gram on its own – no tool can do that. not know it – has obvious utility for programs work correctly. Instead, it gathers information that is developers trying to parallelize existing vfAnalyst is the first of a series of very hard to obtain by hand so that sequential code and for re-optimizing tools created by startup Vector Fabrics you can make the decisions. programs that have already been par- to address this challenge. This tool, Because this is a cloud-hosted tool allelized. through a combination of static and accessed through a Flash browser But it is also useful to architects who dynamic analysis, identifies and char- application, you can view your pro- are simply playing with high-level acterizes the dependencies inherent gram in some novel ways. In one functions that have not been written in a program. This allows two key view, you see a kind of 3D-looking yet. The basic data interactions can be possibilities: you can use the insights display of the functions and loops in sketched and run, even with dummy gained to change the program to the program. algorithms, to determine the most make it more amenable to paralleliza- efficient way to assemble a parallel tion, and you can make decisions as to Another view shows an ASAP (or system while minimizing overhead for how best to parallelize the program. eager) schedule for code that has moving data between threads. Vector Fabrics will be building on this basic technology as new capabilities are added. The goal is to reduce the time and work it takes to parallelize code and to make the process less error-prone. By focusing developers on what they do best – evaluating options and making decisions – the hope is that engineers get to spend more time doing the interesting part of the work even as they get their projects done in less time and with 3D View of the program structure (upper part) and data dependences (bottom): red dependencies create a serious challenge for partitioning; green ones can be managed by higher quality. adding streaming channels after partitioning You can try out vfAnalyst at In the first case, if a program is poorly been parallelized. You can explicitly www.vectorfabrics.com. designed, it will have numerous inter- see the data dependencies as they go twining data dependencies, making back and forth between the different Bryon Moyer it difficult to find places where you threads. [email protected] can pull it apart. Restructuring the

14 info24 HiPEAC Students

ACACES 2010 Report

Vivek Sarkar during his lecture at ACACES

This summer I was lucky to spend one Aware Processor Design, FPGA-Based ners at the summer school. I enjoyed week at ACACES, the HiPEAC summer Reconfigurable Computing, and How the privacy of those two evenings and school that was held at the La Mola to Transform Research Results Into exceptionally nice chats with others conference center some 30 km North- a Business. What made the summer who stayed in. West from Barcelona, Spain. school special was the style of present- ing the information. It was easy to com- Here I have to explicitly say that the To set the tone for this report, I say that prehend the basics given on the first meals were awesome. I got all the ACACES stands for learning, socializa- day and then follow the line through- fruits, meat, wine and cakes I needed tion, and relaxation. out the week. Obviously, decent atten- to tick the checkbox against “food” tion was required, just like in any learn- on my satisfaction list. As far as I am I decided to break the dependency ing process. Occasional questions from concerned, the vegetarian menu was between my flights and the ACACES the lecturers made it even more inter- good too. buses. Also, I wanted to explore beach- esting and involving. The handouts es in Barcelona. So I stayed a couple were handy, the classrooms were nice, To burn excessive calories, I went to of nights in Barcelona before and after the chairs were comfortable. Moreover, the on-site gym and refreshing outdoor the summer school. In other words, shortly after the summer school, all swimming pool. ACACES was a very good reason to pay the materials were made available for a visit to Barcelona! download. Although swimming was officially not allowed after 8pm, the final evening of The conference centre was located in To maintain the sugar and caffeine the summer school was an exception: a the middle of nowhere, and it was levels throughout the day, lavish coffee pool party was thrown after the closing rather hot outside all the time, so eve- breakes were provided. barbecue. That was a perfect time for rybody stayed focused on classes and Another thing that I really appreciated pictures and networking without busi- socialization. Conveniently, the confer- was abundance of free still water: there ness cards. ence center was all ours. were refrigerators on each floor, always stuffed with bottles. To conclude with, ACACES was a great I liked the architecture and design of experience during the great summer of the venue and rooms. They could have In addition to the classes, there was 2010. I learned a lot, met many sharp easily been the focus of an article in the a poster session. I did not present a people, and recharged my solar bat- Wallpaper magazine. poster, so had the freedom and time teries. to move around and interrogate oth- It was my first HiPEAC summer school; The classes were excellent. You can look ers. That was extremely interesting and the organizers set a high standard. I up the website of ACACES 2010 for a informative. look forward to coming back next year detailed list and description of them. and am sure that my expectations will I personally enjoyed those I attended: There were two official evening out- be met. Multicore Programming Models and ings to the city during the week. I did Their Compilation Challenges, Variation- not attend them – prefered the din- Dmitry Knyaginin

info24 15 In the Spotlight

(Open-People) ANR Project Open Power and Energy Optimization Platform and Estimator

Website: www.open-people.fr Partners: Lab-STICC (University of South Brittany) Cairn-INRIA Rennes Bretagne Atlantique (University of Rennes 1, Lannion) INRIA-Lille (University of Valenciennes) INRIA-Nancy, LEAT (University of Nice) Thales (Colombes) InPixal (Rennes) Global architecture for the Open-People platform

Designing low power complex elec- power measurements and helps the The software platform can be used tronic devices is now a key chal- designer to define models of energy in standalone or connected mode, to lenge for corporations in a number of consumption for hardware and soft- drive estimation and optimization and electronic domains. The motivations, ware components of a complete sys- is coupled to an automated hardware which lead designers to consider low tem. These models can be included platform for physical measurements. power designs, are multiple: increase in the component library in order to This hardware platform is also acces- lifetime, increase use time without be used for estimation and optimi- sible through an Internet portal, allow- recharge of battery, limit battery zation design steps. At the end of ing to remotely realize the measure- capacity, limit temperature, etc. Open-People project, the platform will ments needed to build models for new Unfortunately, there are currently nei- propose a set of optimization tools components. In addition, a library of ther methodologies nor tools available at different levels of description and/ benchmarks will be proposed, to help to obtain an accurate estimation of or for the different target boards building models for new components consumption of a complete system at (architectural optimizations, operating and architectures. different levels of description, i.e. at system optimizations, etc). different design steps. The project is open to new collabora- Various research and development tors, which are interested by using the The Open-People project aims at pro- works are currently done in the Open- platform or by contributing with some viding a complete platform People project. These works include power consumption component mod- • to allow rapid power/energy esti- the definition of new methods and els or by methodologies and tools for mation for complex heterogeneous tools to model the different com- estimation and/or optimizations. systems, ponents of a heterogeneous system • to test different optimizations in architecture: processors, hardware The Open-People project is funded order to significantly reduce the accelerator, memories, reconfigurable by the French ANR framework (2009- power consumption of the system. circuits, operating system services, IP 2012). The goal of the Open-People platform blocks, etc. For reconfigurable system, is to provide an access to hardware the dynamic reconfiguration paradigm Daniel Chillet execution boards (processor, dsp, will be modeled to estimate how this fpga, logic analyzers, etc.) and the feature can be used by operating control of power estimations and opti- system to reduce the energy consump- mizations. tion. Furthermore, this project studies how the complete estimation and vali- Through a secured web portal, the dation can be ensured for very com- platform provides an access to the plex system in a small simulation time.

16 info24 In the Spotlight

Coordinator: 9 Million Euros for Invasive Computing Professor Dr.-Ing. Jürgen Teich University of Erlangen- participating scientists”, Jürgen Teich es, compilers and operating systems Nuremberg commented the decision of the DFG. are necessary but also revolutionary Website: www.invasic.de “The TCRC gives us the opportunity to architectural changes in the design of Partners: shape cutting-edge research in the field MPSoCs (Multi-Processor Systems-on- University of Erlangen- of new multi-processor technology and a-Chip) have to be provided in order to Nuremberg programming methods.” efficiently support invasion, infection Karlsruhe Institute of Technology and retreat operations involving con- Technische Universität München, At TCRC, we intend to investigate a cepts for dynamic processor, intercon- Duration: 4 years novel paradigm for designing and pro- nect and memory reconfiguration. Start: July 1, 2010 gramming future parallel computing This process should be fully automated systems, called “Invasive Computing”. and it is our goal to keep it as flexible Within the next few years, multi-core as possible. In July 2010, the German Research computers will have hundreds or even Additionally, it is our policy that soft- Foundation (DFG) has established thousands of processor cores integrated ware is able to adapt to hardware and the new Transregional Collaborative in a single chip. The main idea and nov- vice versa. In this way, the computing Research Centre (TCRC) 89, “Invasive elty of invasive computing is to intro- systems can be designed to operate Computing”, coordinated by the duce resource-aware programming much more efficiently, particularly in speaker Jürgen Teich, head of the support by giving programs the ability terms of their energy budget and their Chair of Hardware/Software Co-Design to explore and dynamically spread their computing power. at the Friedrich-Alexander-University computations to neighbour processors Erlangen-Nuremberg. In collaboration in a phase called invasion, then to exe- with colleagues from the Karlsruhe cute portions of code in parallel based Institute of Technology and the on the available (invasible) region on a Technische Universität München, the given, heterogeneous multi-processor researchers are ambitious to find new architecture. Afterwards, once the pro- ways to design and to program novel gram terminates or if the degree of parallel computing systems. In a first parallelism should be decreased, the phase, TCRC is initially funded for four program may enter a “retreat” phase, years and is supported with around deallocates resources and continues nine million euros in this time. execution, for example, sequentially Project Coordinator “This is a fantastic success for all on a single processor. In order to sup- Prof. Jürgen Teich, port this idea of self-adaptive and University of Erlangen-Nuremberg, resource-aware programming, not only Germany new programming concepts, languag- [email protected] PhD News

Power Modeling of Embedded Systems and Wireless Sensor Networks

By Daniel Schmidt challenge and design optimization for We present XEEMU, a power simula- ([email protected]) this goal has to be part of all system tor for the Intel XScale, a typical rep- Advisor: Prof. Dr. Norbert Wehn design phases. This thesis deals with resentative of an embedded systems TU Kaiserslautern, Germany the creation of accurate power models microprocessor. The power model and July 2010 to be used in early design phases for behavioral model are based on runtime different kinds of embedded systems. In and power measurements on a test The technological progress of the past a purely measurement-based approach platform that allows dynamic voltage decade has triggered a tremendous we show the need of system simu- scaling and SDRAM power manage- growth of the market of mobile embed- lation and optimization and identify ment. We compare XEEMU to state-of- ded systems. This market poses numer- many of the common assumptions of the-art simulators and assess aggressive ous demands, especially small size, very power consumption based on theo- SDRAM power management strategies. low cost, and long battery life time. retical models as overly simplified and We show that previously published Hence, energy efficiency is a key design optimistic. SDRAM power models are inaccurate

info24 17 PhD News

and propose a new model, which is enabling technology for e.g. Ambient correcting (FEC) channel codes. This integrated into XEEMU. XEEMU proves Intellgence (AmI) systems. While early hypothesis is based on a detailed analy- to be the most accurate simulator and, research always promoted to reduce the sis of the power consumption of typical to the best of our knowledge, the only transmission power and range of the nodes and application scenarios. We are system simulator including a precise sensor nodes, we show that the energy also the first to give experimental data SDRAM power model. consumption can be effectively reduced on the efficiency of FEC codes in WSNs. Wireless sensor networks (WSN) are the by the application of forward error

Architectural Support For Parallel Computers with Fair Reader/Writer Synchronization

By Enrique Vallejo Parallel systems require a deeper knowl- such parallel programs. We propose a ([email protected]) edge of the system from the program- Hybrid TM system based on reader/writ- Advisors: Dr. Ramon Beivide mer in order to achieve good perform- er locks, which minimizes the software and Dr. Fernando Vallejo ance and . Traditional parallel overheads when acceleration hardware University of Cantabria, Spain programming has been based on syn- is present, but still allows for correct June 2010 chronization primitives such as barriers, software-only execution. The fairness of critical sections and reader/writer locks, the system is studied, and a mechanism Technological evolution in microproces- highly prone to programming errors. to guarantee fairness between hardware sor design has led to parallel systems Transactional Memory (TM) is a relatively and software transactions is provided. with multiple execution threads. These recent proposal that seeks to remove We introduce a low-cost distributed systems are more difficult to program and the synchronization problems from the mechanism named the Lock Control Unit present higher performance overheads programmer. However, many TM sys- to handle fine-grain reader-writer locks. than the traditional uniprocessor systems, tems still rely on reader/writer locks, and Finally, we propose an organization of what may limit their performance and would get benefited from an efficient a parallel architecture based on Kilo- scalability. These overheads are due to implementation. Instruction Processors, which helps to the synchronization, coherence, consist- simplify the while ency and other mechanisms required to This thesis presents new hardware tech- allowing for high performance thanks to guarantee a correct execution. niques to accelerate the execution of the speculative large instruction window.

Reducing Memory Latency by Improving Resource Utilization

By Marius Grannaes year, the latency only decreases by core might interfere with another core’s ([email protected]) about 6 - 7% per year. This perform- execution by delaying memory requests Advisor: Prof. Lasse Natvig ance gap between the processor and or displacing useful data in the cache. Norwegian University of Science DRAM leads to a problem known as the Prefetching predicts what data is need- and Technology, Norway memory wall. ed in the future and fetches that data June 2010 The main goal of this thesis is to into the cache before it is referenced. improve system memory latency by lev- This dissertation presents a technique Integrated circuits have been in con- eraging available resources with excess for generating highly accurate prefetch- stant progression since the first proto- capacity. The interaction between the es with good timeliness called DCPT. type in 1958, with the controller and the prefetcher DCPT uses a table indexed by the load’s industry maintaining a constant rate is especially important, because of the address to store the delta history of of miniaturisation of transistors and complex 3D structure of modern DRAM. individual loads. Delta correlation is wires. Up until about the year 2002, Utilizing open pages can increase the then used to predict future misses. processor performance increased by performance of the system significantly. DCPT-P extends DCPT by introducing L1 about 55% per year. Since then, limi- Memory controllers can increase band- hoisting which moves data from the L2 tations on power, ILP and memory width utilization and reduce latency at to the L1 to further increase perform- latency have slowed the increase in the same time by scheduling prefetches ance. In addition, DCPT-P leverages uniprocessor performance to about such that the number of page hits are partial matching which reduces the 20% per year. Although the capacity maximized. In addition, when memory spatial resolution of deltas to expose of DRAM increases by about 40% per subsystem resources are shared, one more patterns.

18 info24 PhD News

On the Programmability of Heterogeneous Massively-Parallel Computing Systems By Isaac Gelado ([email protected]) model that integrates accelerator and sys- Memory Model (ADSM) is also introduced Advisors: Prof. Dr. Nacho Navarro tem memories into a single virtual address in this thesis, as a run-time system that and Prof. Dr. Wen-mei W. Hwu space, allowing the CPU code to access implements the proposed programming Universitat Politecnica de Catalunya accelerator memories using regular load/ model. In ADSM, the CPU code can access July 2010 store instructions. Moreover, because a all memory locations in the virtual address single copy of data structures exists in the space, but accelerator code is restricted to This thesis deals with the programmability application virtual address space, applica- access those memory addresses mapped problems due to separate physical memo- tions can use the same virtual memory to the accelerator physical memory. This ries for CPUs and accelerators, which address (pointer) to access such data asymmetry allows all required memory are key to accomplish high performance. structures. coherence actions to be executed by the These separate memories are presented CPU, which allows the usage of simple to application programmers as disjoint The Non-Uniform Accelerator Memory accelerators. virtual address spaces, which harms pro- Access (NUAMA) architecture is proposed grammability in two different ways. First, as an efficient hardware implementation Finally, this thesis introduces the extra code is needed to transfer data of this programming model. This architec- Heterogeneous Parallel Execution (HPE) between system and accelerator memo- ture incorporates mechanisms to buffer model, which allows a seamless inte- ries. Second, data structures are double- and coalesce memory requests from the gration of accelerators in the existent allocated in both memories, which results CPU to the accelerator memory to reduce sequential execution model offered by in using different memory addresses the performance penalty produced by most modern operating systems. The HPE (pointers) in CPU and accelerator code to long-latency memory accesses to accel- model introduces the execution mode reference the same data structure. erator memory. abstraction to define the processor (i.e., CPU or accelerator) where the application This thesis proposes a programming The Asymmetric Distributed Shared code is being executed.

Multithreaded Programming and Execution Models for Reconfigurable Hardware

By Enno Lübbers severely impairs the potential for code sharing of computational resources ([email protected]) re-use. not only for software threads on a Advisor: Prof. Dr. Marco Platzner In this thesis, we present a novel pro- microprocessor, but also for hard- University of Paderborn, Germany gramming model and run-time infra- ware threads within the reconfigura- March 2010 structure for multithreaded program- ble fabric. This is accomplished by uti- Rising logic densities together with the ming of reconfigurable logic devices lizing partial reconfiguration together inclusion of dedicated processor cores called ReconOS. It is based on exist- with non-preemptive and cooperative have increasingly promoted reconfig- ing embedded operating systems and scheduling mechanisms. urable logic devices, such as FPGAs, extends the multithreaded program- • Second, it improves code reusability from traditional application areas such ming model – already established and portability of individual threads as glue logic, emulation, and proto- and highly successful in the software and the execution environment as a typing to powerful implementation domain – to reconfigurable hardware. whole; our reference implementation platforms for complete reconfigurable Using threads and common synchroni- is based on modern platform FPGAs systems-on-chip. zation and communication services as and able to target different software However, traditional design techniques an abstraction layer, ReconOS allows host operating systems (e.g., eCos that view specialized hardware circuits for the creation of portable and flexible and Linux) and processor architec- as passive coprocessors are ill-suited for multithreaded HW/SW applications for tures (e.g., PowerPC and MicroBlaze). programming the combination of fast CPU/FPGA systems. • Third, it allows the transparent inte- CPU cores and fine-grained reconfig- ReconOS uses a pervasive program- gration of dedicated hardware accel- urable logic within these reconfigurable ming paradigm to model the interac- erators and established software computers. In particular, the program- tion of interdependent executable units frameworks, thereby providing access ming models for software (running (threads), regardless of their execution to a vast library of existing software on an embedded operating system) domain (hardware or software). This libraries, network protocol stacks, and digital hardware (synthesized to common programming model has sev- and legacy code. an FPGA) lack commonalities which eral benefits: ReconOS is open-source and available hinders design space exploration and • First, it enables seamless run-time at www.reconos.de.

info24 19 Upcoming Events

The 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43) 4–8 December 2010, Atlanta, Georgia http://www.microarch.org/micro43/

The 6th Int. Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC 2011), 24-26 January 2011, Heraklion, Crete, Greece, http://www.hipeac.net/

The Asia and South Pacific Design Automation Conference 2011 (ASP-DAC 2011) 25-28 January 2011, Yokohama, Japan, http://www.aspdac.com/aspdac2011/

The 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL 2011), 28-28 January 2011, Austin, Texas, USA http://www.cse.psu.edu/popl/11

The 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA-17), 12-16 February 2011, San Antonio, Texas, USA http://hpca17.ac.upc.edu/

The 24th International Conference on Architecture of Computing Systems (ARCS 2011), 22-25 February 2011, Lake Como, Italy, http://conferences.dei.polimi.it/arcs2011/ ITG

The Design, Automation, and Test in Europe Conference (DATE 2011), 14-18 March 2011, Grenoble, France, http://www.date-conference.com/

The 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2011), 5-11 March 2011, Newport Beach, CA, USA, http://asplos11.cs.ucr.edu/

The 14th IEEE International Symposium on Object/component/service-oriented Real-time (ISORC 2011), 28-31 March 2011, Newport Beach, CA, USA, http://dream.eng.uci.edu/isorc2011/

The IEEE/ACM International Symposium on Code Generation Optimization (CGO 2011), 2-6 April 2011, Chamonix, France, http://www.cgo.org/cgo2011/workshops.html

The 5th ACM/IEEE International Symposium on Networks-on-Chip (NOCS 2011), 1-4 May 2011, Pittsburgh, Pennsylvania, USA, http://www.nocsymposium.org

The 32nd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI 2011), 4-8 June 2011, San Jose, CA, USA, http://pldi11.cs.utah.edu/

The 38th International Symposium on Computer Architecture (ISCA 2011), 4-8 June 2011, San Jose, CA, USA, http://isca2011.umaine.edu/

The 48th Design Automation Conference (DAC2011), 5-10 June 2011, San Diego, CA, http://www2.dac.com/dac+2011.aspx

Contributions If you are a HiPEAC member and would like to contribute to future HiPEAC newsletters, please contact Rainer Leupers at [email protected]

HiPEAC Info is a quarterly newsletter published by the HiPEAC Network of Excellence, 24 funded by the 7th European Framework Programme (FP7) under contract no. IST-217068. 20 info Website: http://www.HiPEAC.net Subscriptions: http://www.HiPEAC.net/newsletter

HiPEAC Spring Computing Systems Week, 7-8 April 2011, Chamonix, France