MARCH 2014 HighPerformanceComputing The reference media in HPC and Big Data Volume I | Number 3 Americas

0x03

/success-story DEM MODELING Complex particles domesticated

/hpc_labs DISCOVERING OPENSPL Is spatial programming the future of HPC ?

/state-of_the_art SHIELDS TO MAXIMUM! Simulating the impacts of orbital debris

/hpc_labs PERFORMANCE PROGRAMMING A scientific, practical methodology for measurable results (part II)

THE UBERCLOUD HPC EXPERIMENT: AN INSIDE LOOK How - and why - HPCaaS is becoming a reality

WWW.HPCMAGAZINE.COM

Subscribe Access our archives Discover exclusive contents An HPC Media publication HighPerformanceComputing Americas | MARCH 2014 2

From the Publisher

Dear readers, when do you plan to run your computation / simulation jobs in Volume I, number 3 the Cloud? The question, pushy as it may sound, has never been more timely. For this month’s cover story, prepared in collabo- Publisher Frédéric Milliot ration with the co-founders of the UberCloud HPC Experiment themselves, leaves no room to doubt. Given the level of maturity Executive Editors reached by the technological building blocks, given the virtually Stéphane Bihan endless possibilities offered by infrastructure elasticity, given the Eric Tenin indisputable financial benefits of on demand resources for most non-academic sites, it is clear that the future of supercomputing Advisory Board (to be announced) is largely in the Cloud. You’ll see... Contributing Editors Besides remote HPC and its present pros and cons, this third Amaury de Cizancourt issue of HighPerformanceComputing covers a wide variety of Romain Dolbeau topics, from an off-road example of discrete elements model- Aaron Dubrow ing to the simulation of orbital debris impacts on space vehicles. Michael J. Flynn Space, again, is the raison d’être of OpenSPL, a new program- Wolfgang Gentzsch ming paradigm using spatial locality to raise performance and Stacie Loving energy efficiency to new levels. It is presented to you by its con- Oskar Mencer ceptors. Last, those of our readers who appreciated our in-depth Menuke Rendis Titus Sgro introduction to performance programming will be happy to find John P. Shen it completed this month with a section entirely dedicated to the Burak Yener parallelization of existing codes.

Communication Happy reading! Laetitia Paris [email protected] Submissions We welcome submissions. Articles must be original and are subject to editing for style and clarity. While every care has been taken to ensure the accuracy and reliability of the contents of this publication, no warranty, whether express or implied, is given in relation to the information reproduced in the articles or their iconography. The publishers shall not be liable for any technical, Contacts editorial, typographical or other errors or omissions. [email protected] All links provided in the articles are for the convenience of readers. HPC Media accepts no liability or [email protected] responsibility for the contents or availability of the linked websites, nor does the existence of the link mean that HPC Media endorses the material that appears on the sites. [email protected] No material may be reproduced in any form whatsoever, in whole or in part, without the written permission of the publishers. It is assumed that all correspondence sent to HPC Media - emails, articles, photographs, drawings... - are supplied for publication or licence to third parties on a non- HighPerformanceComputing exclusive worldwide basis by HPC Media, unless otherwise stated in writing. HPC MEDIA All brand or product names are trademarks of their respective owners. We will always correct any 11, rue du Mont Valérien copyright oversight. 92210 St-Cloud - France © HPC Media 2014. HighPerformanceComputing Americas | MARCH 2014 3

CONTENTS THIS MONTH

/news 05 The essential

/viewpoint 19 VDI and the identity aware network

/cover_story 23 The UberCloud HPC Experiment: an inside look

/verbatim 41 Catherine Rivière, CEO, GENCI / Chair, The PRACE Council

/state_of_the_art 51 “Shields to Maximum”, Mr Scott!

/success-stories 57 From the farm to the home: DEM improves the world

/hpc_labs 62 Performance programming: an introduction (part II)

/hpc_labs 70 Discover OpenSPL, the spatial programming language

Visit www.hpcmagazine.com for multimedia content related to the features in this issue.

HighPerformanceComputing Americas | MARCH 2014 5 /NEWS

Human brain vs von Neumann at ISC14 As Professor Karlheinz Meier, of Heidelberg between such a system and von Neumann’s University, emphasizes, the brain is which is based on a conventional processor- characterized by its tremendous energy memory architecture. He will also efficiency, fault tolerance, compact present international projects in size, and ability to develop the field, namely the European and learn. Ultimately, this is Human Brain Project, which is what current computing attempting to build a direct systems are required to link between biological data, have, an expectation that computer simulations, and will only grow as exascale configurable hardware systems come closer to implementations derived reality. This parallel may from neuromorphology. appear utopian, but in Spanning a ten-year light of advances in the period, the Human Brain neurosciences and the Project is being financed as particular technical features part of the European Union’s of new VLSI computing nodes, FP7 and the Horizon 2020 plans. this reality is on the horizon. The objective is to build a new infrastructure for neurosciences and In any event, this is the idea Prof. Meier intends brain research in a medical and information to promote in an ISC keynote this coming June. technology context. Rest assured we will be What’s more, he will highlight the differences there and, by proxy, you as well.

Bristol labeled Intel Parallel Computing Center (IPCC) After the TACC, the Universities of Tennessee and Purdue, ICHEC, ZIB, and CINECA, Bristol Uni- versity in Great Britain has become the seventh Intel center dedicated to parallel computing, and is now labeled an IPCC. These centers’ vocation is to modernize current codes to make them more parallel-capable, so that modern processor architectures can be put to better use. They also aim to train students, scientists, and researchers in parallel programming techniques, which is a whole other challenge in itself.

Intel chose Bristol University for its research in manycore parallel architecture usage and for its leadership in standard and open models in parallel programming. This falls into the category of exploratory academic work that the chip manufacturer finances generously. Moreover, IPCCs are part of a program in which the more concrete objective is to address current challenges in parallel computing in the fields of energy, industrial science, life sciences, etc. As such, they re- ceive funds for a two-year period which can be renewed, depending on provable results. Bristol, in particular, will be in charge of optimizing applications in the fields of fluid dynamics and mo- lecular engineering. HighPerformanceComputing Americas | MARCH 2014 6 /news

NVIDIA Maxwell: in total discretion... We’re not accustomed to a quiet release of a new GPU Dassault Systèmes + architecture from NVIDIA. Yet this is what has happened with the launch of the GeForce GTX 750 Ti card, which introduces the Accelrys, a chemical Maxwell architecture, successor to Kepler and eagerly awaited marriage by the HPC community but perhaps less so by gamers. And for good reason, as this new discrete GPU will not surpass records in terms of image resolution.

It will, however, double Kepler’s energy efficiency - a significant advance for the mobile market (notebooks and tablets). A practical advantage of this efficiency is, for example, the absence of an external PCI-Express connector to power the GPU: the energy comes from the card to which it is connected which, By launching a friendly take- incidentally, will simplify the design of hosting hardware . over bid on Accelrys, a scientif- ic software solution publisher based in San Diego, Dassault Systems’ aim is to enrich its product lifecycle management solutions in the fields of molec- ular chemistry, from discovery to the production phase, in the life sciences industries, con- sumer products, high technol- ogy, and energy. The company is also acquiring a prestigious client portfolio consisting of the likes of Sanofi, DuPont, Uni- lever, and L’Oréal.

This majority stake will indeed Now, some of Maxwell’s features are worth mentioning. Overall allow Dassault to broaden the performance is increased by 35% for a 1:4 DP/SP ratio. The main spectrum of scientific and tech- innovation comes from the multiprocessors’ architecture. New nological fields targeted for the development of new products, SMMs (streaming multiprocessors) replace Kepler’s SMXes and from original idea to final de- are now cut into four smaller blocks supporting 32 CUDA cores sign. Accelrys already provides each, or 128 on an SMM as opposed to 192 on an SMX. PLM solutions with a software portfolio allowing scientists to This granularity combined with better workload distribution access, organize, analyze, and improves the efficiency of parallelism. The occupancy is share data in order to improve thereby doubled, provided that the use of registers and shared their productivity. The bid, ap- memory is not a limiting factor. Another notable feature is the proved by the Accelerys Board larger L2 Cache (up to 2MB). Last, unified memory support is of Directors, is for all outstand- now at the hardware level - another important improvement ing shares converted into cash over Kepler. for a valuation of $750 million. HighPerformanceComputing Americas | MARCH 2014 7 /news

Lenovo’s new deal Lenovo’s acquisition of IBM’s x86 server division erful configurations. The result is that by buying is being seen as a major disruption in the OEM IBM’s x86 server business, Lenovo is acquiring a server market. And indeed, based on the cur- choice position at the top of the chart. All other rent Top500 figures, it’s easy to realize the im- things being equal, Lenovo would come in sec- pact. In last November’s list, the number of IBM ond place behind HP, with 25% of the listed sys- machines accounted for about one third of the tems, but would drop to fifth place in terms of total number of systems, the same as HP. overall performance.

In terms of combined performance, this rep- How things will play out in the long term remains resents 32% of the total power, vs 16% for HP. to be seen. Although Lenovo’s know-how is un- Now, let’s take a look at x86 computers. They questionable, the fact that it is not American account for approximately 77% of the machines could cause it to lose some public and private on the list, but only 34% of the cumulative pow- contracts in the US. er. BlueGenes and other amply-sized Power-based sys- tems are indeed the ones with the most pow-

PGI compilers: the 2014 collection has arrived! The latest versions of the PGI development tools from OpenACC 2.0’s important advancements, (which have been under the NVIDIA banner since namely global and dynamic data management. last year) are revealing some interesting new For more information on this last topic, essential features. In addition to the anticipated support to code acceleration, feel free to refer to two in- for the Tesla K40 accelerator with version 2.0 of depth articles which your favorite magazine has the OpenACC instruction set, PGI is also open- just devoted to the subject. ing OpenACC up to AMD’s APUs and dis- crete GPUs. Those who were worried about the publisher’s independence after the buyout can lay their fears to rest! Another noteworthy ad- vancement is the interoperability of the CUDA Fortran and OpenACC version with the Allinea DDT debugger. Lastly, our readers who develop on the Mac will be pleased to find a now free version of the Fortran 2003 and C99 compilers for that platform.

This 2014 crop is thus improving the portability of code to various acceleration platforms, a spe- cialty that PGI shares with the CAPS OpenACC compiler which, for its part, also supports In- tel’s Xeon Phi. It also allows everyone to benefit HighPerformanceComputing Americas | MARCH 2014 8 /news

China will have its Power8

The mutual interest was obvi- the Power8 plans as it sees fit The first versions of Power8 - ous. On one end, IBM want- and have the resulting proces- all developers combined - are ed to extend the realm of the sors manufactured where it scheduled for September of OpenPower Consortium as far pleases. this year. IBM spokespeople and wide as possible, because have announced 12 cores run- the future of high performance As our regular readers know, ning at 4 GHz with 8 threads per computing does not necessar- the Beijing Academy of Scienc- core, 96 MB of L3 cache on the ily lie in x86. On the other end, es, which serves as a governing die and 128 MB of L4 cache in the Chinese government was body for HPC in the Celestial the memory controllers, for a seeking to occupy every space Empire, is currently working on sustained memory bandwidth where the future of high per- no less than six different pro- of 230 GB/s and an I/O band- formance computing is play- cessor architectures. Which is width of 48 GB/s, that is, ap- ing out. Which is why China further evidence of the impor- proximately twice as much as will have its Power8 processor, tance the Chinese attribute to the Power7+. with IBM granting an industrial high performance computing. license for its architecture to a With Power8, the objective, Since Suzhou PowerCore is a company created for the occa- however, is more commercial subsidiary of C*Core, another sion, Suzhou PowerCore. than technical or strategic. Ac- Chinese company which has cording to Suzhou PowerCore produced more than 90 million Similarly to what ARM does (in- officials, the projects targeted SoCs (ARM, etc.) so far, the Pow- telligently) to unite a commu- by these processors, aiming for er8 made in China should be nity of hardware and software the first half of 2015, are less widely distributed. What bene- developers, to develop an eco- oriented toward research than fits it will have over its Western system and to ensure that its “large-scale analytics” and Big competitors now remains to be architectures live on, IBM con- Data. seen. tinues to open up to businesses interest- ed in the new itera- tion of Power. Until now, there haven’t been very many ma- jor participants. In addition to Google, which is always in the right place at the right time, NVIDIA and Mellanox are also in the mix. But the arrival of a Chi- nese designer adds another dimension, especially since the agreement specifies that Suzhou Pow- erCore may modify HighPerformanceComputing Americas | MARCH 2014 9 /news HPC in 2014 : the major trends according to IDC As is customary every year at this time, IDC has Now let’s get into details. The High Performance just published its ten major predictions for the Computing server business declined by nearly HPC market in 2014. Those familiar with IDC $1 billion in 2013 after three record years, but know that this document is more like an in-depth should bounce back in 2014 and return to levels analysis of our ecosystem and its trends than a just short of the 2012 figures, in volume or num- crystal ball game. ber of units shipped. IDC estimates that Leno- vo’s arrival on the scene should lead to some de- And it just so happens that these trends are oc- gree of market share redistribution. So far, IBM curring in an evolving landscape. It starts with the and HP each account for about one third of this now nearly universal recognition of high perfor- market, with Dell in third position at about 15%. mance computing as an economic and strategic According to IDC, Lenovo should be capable of value giving the industry a clearly positive out- matching Dell’s share.

look for the longer term. Additionally, although As for , IDC expects to see Big Data is booming ($16.1 billion forecasted 100-PFlops systems between late 2014 and early for 2014), its relevance for most businesses has 2015 in China, the US, Europe (PRACE), and Ja- yet to be established. Also worthy of note is the pan. And when it comes to exascale systems, an- confirmed takeoff of Cloud Computing, as wit- alysts think that their roll-out in 2020 will not be nessed by the constant increase in offerings and impacted by any disruptive technology, and that players with a broad range of service types. Last- it is now just a matter of investment. Similarly, ly, Lenovo’s very recent acquisition of IBM’s x86 the study is projecting a power consumption of server business is being seen as a major disrup- 20 to 30 MW, but does not expect this energy ef- tion of the dynamics in the OEM server market. ficiency to be reached before 2022-2024. HighPerformanceComputing Americas | MARCH 2014 10 /news

While a high-end currently costs performance improvement is a deterrent to mi- between $200 and $500 million - remember that gration. But, somewhat surprisingly and without price tags rarely exceeded $100 million just 3 to providing any decisive arguments, IDC expects 4 years ago - IDC confirms that the amount could major leadership changes in the CPU market. reach one billion in the next three years. Under- standably, at these levels, ROIs will have to be IDC has not forgotten storage and interconnects, justified! We can therefore expect an explosion which are essential elements in a data-saturated of communication on achieved results, be they world. Sales derived from storage, the most dy- scientific, economic or even societal. In addition namic segment of the HPC market, should reach to ROI, technology transfers and competitive- record highs ($6 billion by 2017). However, with ness are among the main reasons why partner- the flow of information being a difficult yet cru- ships between computing centers and industry, cial issue, the interconnects market is in full such as PRACE and INCITE, are on the rise. transition, a transition gradually leading it away from the traditional computation-centric model. On the electronics end, IDC notes that, although coprocessors and accelerators have become And lastly, finishing with the Cloud, the number mainstream (77% of high performance comput- of sites using on-demand services for their HPC ing sites are so equipped versus 28% in 2011), needs has grown from 13.8% in 2011 to 23.5% in they are still widely used for experimental pur- 2013, making it a genuine trend. Here, too, the poses. So the dominance of x86 processors is roadblocks to adoption are known: data secu- not being threatened, not even close, as they rity, transfer time, performance of highly paral- still account for 80% of sales. lel loads and application-defined infrastructure. But the major players in this industry have now For IDC, the main reason for this roadblock to taken these roadblocks into account, resulting in accelerator adoption is programming difficulty, a gradual emergence of offerings and solutions or at least the necessary additional effort. The is- intended to solve these problems. So you can sue of the time it takes to program or reprogram bet the real figures for 2014 will show the same an application in anticipation of the expected progression dynamic.

IDC Top 10 HPC Predictions for 2014

1. HPC server market growth will continue in 2014, after a decline in 2013 2. The global exascale race will pass the 100PF milestone 3. High performance data analysis will enlarge its footprint in HPC 4. ROI arguments will become increasingly important for funding systems 5. Industrial partnerships will proliferate, with mixed success 6. x86 base processor dominance will grow and competition will heat up 7. Storage and interconnects will benefit as HPC architectures gradually course-correct from today’s extreme compute centrism 8. More attention will be paid to the software stack 9. Cloud computing will experience steady growth 10. HPC will be used more for managing IT mega-infrastructures HighPerformanceComputing Americas | MARCH 2014 11 /news

"The value of an idea lies in the using of it"

The famous quote by Thomas Edison applies per- In addition to this computational capability and fectly to the new computer at the NERSC (Berke- immediate availability, Edison incorporates an ley) bearing his name. Although the highly ener- advanced “free cooling” (ambient air cooling) gy-efficient machine has a “modest” theoretical technology also developed by Cray. The princi- peak computing capacity of 2.4 Pflops, it was de- ple of this system, free of mechanical workings, signed not only to perform large-scale simula- is based on air flow between cabinets rather tions and modeling, but also and predominantly than through the cabinet from the door to the to process the enormous datastream they gen- back. Concretely, each cabinet contains a heat erate. “DoE sites are inundated with data that re- sink that cools the air before it moves on to the searchers cannot adequately process or analyze,” next cabinet. According to Cray, the machine explains Sudip Dosanijh, Director of the NESRC achieves a PUE of approximately 1.1, a figure Division. “That’s why Edison was optimized for that is indeed easier to reach with a passive ap- these two types of computing, which require rapid proach. management of data movements.” In figures, Edison features 2.57 Pflops of peak Indeed, Edison, a Cray XC30, was developed power, 357 TB of aggregate memory, 5,576 with ultra fast interconnectivity, a wide memory nodes dispatched in 30 cabinets, 133,824 Intel bandwidth, a lot of memory per node, and very Ivy-Bridge processor cores (12 cores per com- quick access to the file system and disk drives. In pute CPU), 7.56 PB of disk storage, 462 TB/s of fact, it is the successor to a Cray XE6, which had bandwidth for global memory, 163 GB/s of I/O become somewhat obsolete given the specifici- bandwidth, and 11 TB/s of bidirectional network ties of these tasks. As it does not include accel- bandwidth. erators, no code needs to be rewritten, which makes the machine immediately available for simulations, a time saver clearly appreciated by the lucky scientists who will work with it. HighPerformanceComputing Americas | MARCH 2014 12 /news

Xeon E7 v2 : the target is the Big Data!

After three long years, this refresh of the Xeon E7 With 50% more cores (for a total of 4.3 billion 8800/4800/2800 family was highly anticipated. transistors per die), each processor can contain Based on the Ivy Bridge EX architecture, which up to 15 cores, compared to 10 with the previ- is now manufactured with a 22 nm process, this ous generation. Why 15 instead of 16, a logical new generation is further distinguished by 50% multiple of two for binary systems? Rumor has more cores. it that a backup core is reserved for Intel’s new Run Sure technology. The main point however is that it was primar- ily designed to meet a specific application ob- As for configurations, 2, 4, and 8 processors can jective, i.e. the real-time processing of massive be QPI-linked in the same system. As we go to quantities of data. To that end, the memory ca- (digital) press, some OEMs, such as HP or SGI, pacity is tripled compared to the previous gen- are designing their own controllers (and chip- eration, with up to 1.5 TB of DDR3 per socket: sets) to accommodate more processors, and a ideal for storing large databases and eliminating certain number of other big guys may follow. costly disk access! Obviously, 6 or 12 TB in ma- Now, regarding the thermal envelope, it appears chines having 4 or 8 sockets (32 is the maximum to be remarkably well controlled as it should not number of sockets per machine) will allow data- exceed 155 watts in the hottest configurations. intensive applications to benefit from better la- The pricing? In the neighborhood of $6,500 per tencies in NUMA environment than if they had unit. For the record, Intel is claiming 80% better to run on a cluster using Ethernet or InfiniBand, performance than IBM’s POWER7+ for an 80% given the same capacity in terms of the number lower TCO over four years, thereby reducing the of memory sockets. RISCs, so to speak...

All our articles, our columns, our interviewes, our source codes and so much more...

www.hpcmagazine.com HighPerformanceComputing Americas | MARCH 2014 13 /news

1,4 Tbps sustained over a commercial network!

Month after month, records get broken. At the this performance is not such a big deal because same time, it’s precisely what they’re made for... of the pull of gravity along the North-South direc- At ISC 2013, we reported on Alcatel Lucent’s and tion, let us add that the flow rate was achieved T-Mobile’s successful experiment, which con- in both directions! Seriously, Alcatel said that all sisted in sending a sustained 400 Gbps dataflow it had to do was reprogram its hardware plat- between two climatology and industrial digital forms. It is therefore an important advancement simulation applications. This was already far be- from an industrial standpoint. For example, yond the standards at the “time,” even though Huawei had demonstrated a 2 Tbps rate over the communication had been conducted un- the Vodafone network a few months ago, but der near-experimental conditions. Well, we just with substantial modifications to the existing in- learned this month that Alcatel has done it again, frastructure. in partnership with British Telecom, this time with a sustained flow rate of 1.4 Tbps - in itself Recordswise, the bar is now set to 10 Tb. In the nothing to sneer at - on an existing production laboratory, Alcatel successfully sent 31 Tbps network. You read correctly: neither the hard- over an underwater fiber measuring 7,200 km ware nor the fibers had to be changed. long, with optical amplifiers every 100 km, while NEC and Verizon have reached 40.5 Tbps over The achievement took place in October 2013 on 1,800 km of “conventional” fiber. Each of these a fiber-optic network connecting the BT Tower in two innovative companies is hinting at new an- London to the Adastral Park University campus nouncements for SC 2014 which, let’s not forget, in Suffolk, but it was not made public until this is also the high performance communications month. To disbelievers who would object that conference... HighPerformanceComputing Americas | MARCH 2014 14 /news

Bull refocuses on its added value

While, for financial analysts, the objective of the new “One Bull” plan is to double the group’s operational performance to 7%, for the rest of the world - namely associates and custom- ers - the ambition of the an- nounced strategy is to refocus Bull on its business with the highest added value. The com- munication people put forward the Cloud and the Big Data, two catch terms in a context where Bull is clearly positioned to be- come the “trusted operator for business data.” But the details of the plan reveal a more struc- ture-centric reorganization hin- ging around the two dimen- formance computing and large the current 50 subsidiaries), sions of “data infrastructure” infrastructures, complex soft- and a unified sales organiza- and “data management,” with ware integration, and data se- tion. It must be stressed that all that implies for technical curity. this strategy does not include team redeployment. a layoff plan and that, at the As such, Bull wants to simplify same time, it earmarks a large Indeed, One Bull, to be imple- its internal organization with a proportion of revenue - 6% - to mented until 2017, is based substantial reduction in its num- technical and growth invest- on three pillars, which can be ber of operational units, a glob- ment via the “iBull” program. considered logical in light of al geographic grouping around Not common currency these the group’s expertise: high per- five regional hubs (instead of days.

Mellanox publishes its Ethernet Switch API It’s done. Mellanox just released its Ethernet Now, let’s put things in context. Customer Switch API. The Software Development Kit (SDK), needs - public or private - regarding networking with full support for Layer 2 and Layer 3 function- are changing, especially with the emergence of ality, can now be downloaded from GitHub. For SDN (software-defined networking) and the rise Mellanox, this move represents a contribution in power of OpenFlow. The fact that Mellanox, to the Open Ethernet initiative and an additional uniquely positioned in distributed and I/O-inten- commitment to the Open Compute Project. The sive high performance communications, is ac- release is intended to enable the community to tively involved in the establishment of an open develop Open Source network applications, op- standard for Ethernet can only strengthen its le- timize application communication protocols and gitimacy as a provider of more demanding solu- establish a solid foundation for the development tions. Down the road, the winner may very well of a standardized switching interface. spell "InfiniBand". HighPerformanceComputing Americas | MARCH 2014 15 /news

Microsoft opens its cloud It is quite exceptional to be noticed: Microsoft ing his keynote at the Open Compute Summit participates in an Open Source event. No, Win- in San Francisco. “And of course, we also want to dows’ source code is not being released (yet) but, learn from the community...” by becoming a member of the Open Compute Project created by Facebook three years ago, For the record, the design of the servers that Microsoft is opening the infrastructure design of host the Windows Azure public cloud shows its Windows Azure cloud. significant economies of scale. The chassis, for instance, would help save 10,000 tons of metal Although a major player of the likes of Amazon and 1,700 km of cables, simply by being modu- and others, Microsoft probably does not want to lar. Without going into detail, each rack can be miss the boat on this promising sector: “We do configured as a compute or storage node. They this because we want to lead innovation in the de- can plugged or removed without wiring which, sign of the Cloud and of datacenters” said Bill La- for centers hosting hundreds of thousands of ing, Corporate VP of Cloud and Enterprise dur- servers, represents a substantial time saver.

ARCHER up and running!

Like NESRC, which recently celebrated the launch simply by recompiling their applications with the of the Edison machine, the University of Edin- provided suite of tools, namely compilers from burgh welcomes its own CRAY XC30, christened CRAY, Intel and GNU. ARCHER, due to replace HECToR (a CRAY XE6, like Edison’s predecessor, that was put in production just seven years ago).

ARCHER, short for Academic Research Comput- ing High End Resource, is part of the resources available to the academic research of the Brit- ish Crown and connected to the UK Research Data Facility network. The machine, totaling 1.56 Pflops peak, features the latest 12-core Intel Xeon E5 v2, two in each of the 3008 nodes dis- patched in 16 cabinets. The CPUs are connected through the CRAY Aries network interface. As at NESRC, scientists can already use the machine, HighPerformanceComputing Americas | MARCH 2014 16

Who wants to win an AMD FirePro S10000 6GB accelerator? (Get lucky, there’s 2 to be given away this month - again)

Fits comfortably in OpenCL developers . Two high-end AMD GPUs directly on-board for powerful dual GPUs programming. Free AMD Accelerated Parallel Processing SDK, OpenCL 1.2 compliant with BLAS and FFT libraries. OpenCL 2.0 ready! New! Free AMD CodeXL 1.3 with CPU-GPU debugger, profiler and static OpenCL code analyzer.

AMD FirePro S10000 6GB Active : 5.91 TFlops peak SP, 1.48 TFlops peak DP, 6GB GDDR5 memory, 480 GB/s memory bandwidth, Dual- slots form factor, to be used in a .

Future-minded developers want a parallel computing architecture that does not restrict their ability to use open tools and to develop cross-platform. Managed by The Khronos Group, like the widely used OpenGL framework, OpenCL is the solution to this legitimate demand. OpenCL is all about easy code portability, which makes it the only future-proof path to address all HPC coding requirements.

Last month’s winners: Marcin Ziolkowski, Clemson University Gregg Charles, GX Technology The operation is now closed.

© 2013 , Inc. AMD, the AMD logo, FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple, Inc. used with permission by the Khronos Group. Other names are used for identification purposes only and may be trademarks of their respective owners. See www.amd.com/firepro for details. HighPerformanceComputing Americas | MARCH 2014 17 /AGENDA

MARCH 2014 JUNE 2014

GPU Technology Conference 4th Cloud & Big Data Summit 6th Annual Cloud World Forum Where: San Jose, CA, USA Where: London, UK Where: London, UK When: March 24-27 When: April 10-11 When: June 14-18, 2014 GTC is the world’s biggest The 4th Cloud And Big Data Cloud World Forum is EMEA’s and most important GPU de- Summit 2014 will focus on leading Cloud event with the veloper conference. It offers the emerging area of cloud industry’s most comprehen- unmatched opportunities to computing enhanced by latest sive agenda of all things Cloud. learn how to harness the lat- developments related to in- est GPU technology, along frastructure, operations, secu- ISC’14 with face-to-face interaction rity, governance and services Where: Leipzig, Germany with industry luminaries and available through the global When: June 22-26, 2014 NVIDIA experts. Stay tuned for network. Panels, sessions and Join the global supercomput- announcements on who’ll be workshops are designed to ing community, ISC’14 again bringing the ‘wow factor’... cover a range of latest topics, brings together 2,500 system trends and innovations related managers, researchers from 28th Annual HPCC-USA to cloud computing and data academia and industry as well Conference management. This Confer- as developers, computational Where: Newport, RI, USA ence is a European platform to scientists and industry affili- When: March 25-27 identify your company as a key ates from over 50 countries. Supercomputing: What Does player in the dynamic game the Future Hold? Welcome to changing market priorities in the next conference of the Na- the field of cloud computing, tional High Performance Com- data and information security. puting and Communications Council... MAY 2014 To our readers APRIL 2014 2nd ASE International Con- ference on Big Data Science EASC2014 - 2nd Exascale and Computing Applications and Software Where: San Jose, CA, USA Conference When: May 27-31, 2014 In last month’s Where: Stockholm, Sweden The Second IEEE/ASE Interna- BloodHound SSC tional Conference on Big Data When: April 2-4 Success-story, we This conference brings togeth- Science and Computing aims failed to mention that er developers and research- to bring together academic sci- the interview was ers involved in solving the entists, researchers, scholars originally published by software challenges of the ex- and industry partners to ex- Renuda. We apologize ascale era. The conference fo- change and share their experi- for the oversight. cuses on applications for exas- ences and research results in cale and the associated tools, Advancing Big Data Science & software programming models Engineering. and libraries.

HighPerformanceComputing Americas | MARCH 2014 19

/viewpoints

VDI and the Identity Aware Network Many speculated that Virtual Desktop Infrastructure (VDI) was just going to be a phase, but on the contrary, VDI has not only proven to be a con- tender, but has become a growing necessity for IT departments. Although it may appear that interest in VDI has gone into a downward spiral, this enabling technology, according to a certain number of analysts, is actually on the rebound. ‘‘Historically, data centers have been designed for identity management are critical network secu- physical environments. This means that compa- rity considerations as users can connect to the nies today are applying yesterday’s tools and ca- data center from any location, using a variety pabilities to address current and future oppor- of devices. At the network level, more granular tunities: trying to virtualize IT architectures that network access policies based upon user roles, were not necessarily optimized for virtualization, device types and physical locations are required. especially in terms of the network. VDI, in partic- The network then has to scale bandwidth, man- ular, has and will continue to become a staple in age converged communications and implement the new data center and crop up in progressive network layer security policy independently IT departments at colleges, law firms, hotels and from any single device or application. The access retail establishments. The business and techni- management and lack of identity features of old cal efficiencies involved with establishing VDI are networks, however, won’t be up to par. Before relatively simple and straightforward consider- moving to VDI, data center managers must un- ing the significant improvements VDI can deliver derstand how network performance will be im- to network manageability, security and energy pacted while maintaining cost, delivery of mul- efficiency. tiple converged services and power efficiency.

VDI delivers significant improvements to network manageability, security and energy efficiency.

According to The Cloud-Based Virtual Desktop How will VDI help increase energy efficiency? Infrastructure Market 2012-2017 published in September 2012 by VisionGain, the VDI market While VDI is partially driven by the use of lighter- was expected to grow to $11.2 billion by the end weight devices, such as smartphones and tab- of 2012 and continue to increase at a compound lets, the network plays a key role. It helps to re- annual growth rate (CAGR) of 14.77% through duce energy consumption through the central- 2015. Large enterprises are drawn to VDI be- ization of resources and by bringing much higher cause of its ability to reduce desktop support speeds at the port level. VDI allows higher den- and management costs, as well as lower overall sity 10 Gigabit Ethernet (GbE) port modules on energy requirements of virtual desktops. chassis type switches providing the advantage to collapse all traffic easily into just a few net- To meet compliance and security regulations, work switches. This is all possible through the VDI also provides business continuity and di- consolidation of horsepower into a single core saster recovery capabilities within the data cen- layer as opposed to deploying distributed GbE ter. VDI does, however, demand an identity ap- LANs and multiple tiers, thus providing the nec- proach and a more aware network. Policy and essary bandwidth for all VDI connections. Over- ‘‘

HighPerformanceComputing Americas | MARCH 2014 20

VDI has shown itself to be more powerful and easier for IT departments to manage, while also achieving the goal of being “green” and highly efficient.

all, VDI has shown itself to be more powerful VDI, security, including network identity, is now and easier for IT departments to manage, while simplified, centralized and managed by the net- also achieving many businesses’ goals of being work instead of a PC OS. “green” and highly efficient. With the growing number of secure government How can the network converge voice, video facilities choosing to use advanced identity man- and data for VDI deployments? agement, which is only possible via VDI, we can expect the private sector to follow by enabling After addressing the considerations for system similar deployments for their mobile workforc- centralization and bandwidth, IT departments es. To this end, today’s businesses looking to will be faced with the next issue of how to carry deploy VDI securely require a new model called converged media – mixed voice, video and data. identity-aware networking – a term Jon Oltsik, To deliver voice and video traffic to users on pre- principal analyst of Enterprise Strategy Group at determined priorities, it is imperative that the Extreme Networks defines as “a policy-based net- network be capable of not only 10 Gigabit and work architecture that understands and acts upon Gigabit to the Edge, but also intelligence, Qual- the identity and location of users and devices.” ity of Service (QoS) and ultra-low latency switch- ing. Just like traditional networks, for users to be With identity-aware networking, the network able to experience consistent and predictable in- gathers information from multiple existing teractions, the backbone of a VDI network must sources in an integration process enabling IT be capable of handling convergence flawlessly. departments and managers to use this data to Once the network has been deemed seamless build and enforce access policies. With this rich and demonstrates the required quality, it is at data (i.e., user, device, and location) in hand, this point that critical activities like IP phone calls network administrators can easily configure ex- and collaboration, e-learning activities using IP tremely granular network access policies that video and call centers can be functional. can then be enforced based upon any or all of this information and/or other factors. For exam- How does the network support the security ple, the CFO of a company could be granted ac- of VDI deployments? cess to the end-of-quarter financials from their laptop on the corporate LAN or on a home com- With the VDI network, traditional operating sys- puter connected over a VPN, but not from other tems are eliminated, yet user log on, security devices or networks. In other cases, a contractor policies, visibility and monitoring are required may be able to access engineering plans during more than ever. Until today, many companies specified hours only. have relied upon the security of traditional net- works that entailed complex “application layer” In this situation, user and device location is im- elements of sign-on security, including strong portant for several reasons. For starters, identi- authentication, Single Sign-On (SSO) systems ty-aware networking confirms if the user is log- and LDAP directories. For the companies that ging on from a trusted or untrusted network. have taken the leap toward the emergence of This location awareness may also be important ‘‘

HighPerformanceComputing Americas | MARCH 2014 21

In addition to who is connecting to the network, identity-aware networking verifies what is connecting to the network...

depending upon whether a user is accessing the Although virtualization may have started out network from a wired port or over Wi-Fi. Further- as a technology driven by server consolidation, more, depending on the device location, wheth- today’s evolving network takes it well beyond er within a single facility or between two facili- servers to a means of centralizing IT itself. The ties on the same campus, the network access evolution starts with desktop PCs and the new policies may change. In addition to who is con- class of wireless computing devices proliferat- necting to the network, identity-aware network- ing throughout the enterprise, including smart- ing verifies what is connecting to the network. phones, tablets and PCs. In today’s Internet- This is important as different devices (i.e., lap- connected world, large organizations need their tops, tablets, smartphones, power meters, etc.) networks to enable any user to connect securely have different security and performance charac- to applications and services from any authorized teristics. Just as a contractor may only have ac- device. cess during certain times of the day, their laptop may also be treated differently than a remote What companies, both enterprises and SMBs, employee’s PC that meets all corporate security need is a “best of breed” network with the intelli- and configuration policies. gence to not only enforce their policies once us- ers are on the network, but to also dynamically Network-based identity is associated with infor- collect information about the users in the net- mation including IP and MAC addresses, VLAN work, the devices trying to connect to the net- tags and subnets which can all play a significant work and where the users are physically when role in device authentication, VPNs and IPSEC – trying to connect to the VDI infrastructure. In the thus network layer security takes over. The net- end, IT departments will be able to focus more work-based identity looks at a number of spe- on other tasks such as doing business, regulato- cifics when securing the network ranging from ry compliance and security ROI benefits because the number of inputs (user ID and the role of they will dedicate less time to maintaining and the user), device characteristics and capabilities managing application-layer security. which are specific to that unique device, and user/device location. VDI demands an identity approach and a more aware network, and only when data center With most deployments, IT managers and de- managers closely examine the network’s role in partments strive to meet daily challenges and meeting key criteria such as cost savings, power varying mobile users and devices. To make this efficiency, user and device identity, and ease-of- possible, more granular network access policies use can VDI truly progress towards becoming a based upon user roles, device types, and physi- new norm in computing. cal locations will be required at the network lev- el – a situation that can be handled by VDI. Once this has been achieved, the network then has to Renuke Mendis scale bandwidth, implement network layer secu- Principal Technical Marketing Engineer rity policy independently from any single device Extreme Networks or application, and manage converged commu- nication appropriately. AGILE PERFORMANCE

Kalray MPPA®-256 256 cores, 25 GFlops/Watt

World first massively parallel, scalable, MIMD processor optimized for high performance computing and power consumption

www.kalray.eu

Development workstation SDK C/C++/FORTRAN, debugger and trace Reference boards MPPA DEVELOPER MPPA ACCESSCORE MPPA BOARD

© Copyright KALRAY SA. All rights reserved. Kalray, the KALRAY logo, MPPA, MPPA MANYCORE, MPPA DEVELOPER and MPPA ACCESSCORE are registered trademarks of KALRAY. HighPerformanceComputing Americas | MARCH 2014 23 /cover_story

AN INSIDE LOOK AT THE UBERCLOUD HPC EXPERIMENT Once a matter of theoretical debate, high performance computing in the cloud is now becoming an actionable reality. After reviewing (and demystifying) the issues tradi- tionally associated with it - performance, cost, software licensing, security - let’s take an insider look at the advances, challenges, and a few application use cases from the UberCloud HPC Experiment... WOLFGANG GENTZSCH* / BURAK YENER*

Cost savings, shorter time to only about 10% of manufactur- adequacy of their equipment; market, better quality, fewer ers currently use HPC servers more precise geometry or phys- product failures… The benefits when designing and developing ics, for instance, require much that engineers and scientists their products. The vast major- more memory than a desktop could achieve from using HPC ity (over 90%) of companies still could possibly possess. Today, in their research, design and perform virtual prototyping or there are two realistic options development processes can large-scale data modeling on to acquire additional HPC com- be huge. Nevertheless, accord- workstations or laptops. It is puting: buy a server or use a ing to two studies conducted therefore not surprising that cloud solution. by the US Council of Competi- many of them (57%) face appli- tiveness (Reflect and Reveal) [1], cation problems due to the in- * Co-founders, The UberCloud. HighPerformanceComputing Americas | MARCH 2014 24

Many HPC vendors have de- es that offer higher quality re- ture (CAPEX). It offers business- veloped a complete set of HPC sults. In management and fi- es greater agility by dynamical- products, solutions and servic- nancial terms, a Cloud solution ly scaling resources as needed es, which makes buying an HPC helps reduce capital expendi- and is only paid for when used. server no longer out of reach for an SME. Owning one’s own HPC server, however, is not nec- 1 - What is an HPC cloud? essarily the best idea in terms of cost-efficiency , because the According to the National Insti- (PaaS) and Infrastructure as a Total Cost of Ownership (TCO) tute of Standards and Technolo- Service (IaaS) and four “deploy- is pretty high, especially con- gy (NIST) [3], “ Cloud computing is ment models” (private cloud, sidering that maintaining such a model for enabling ubiquitous, community cloud, public cloud a system requires additional convenient, on-demand network and hybrid cloud). specialized manpower. access to a shared pool of config- urable computing resources (e.g., Standard cloud services can ad- In addition to the high costs of networks, servers, storage, appli- dress a certain portion of HPC expertise, equipment, mainte- cations and services) that can be needs, notably those that don’t nance, software and training, rapidly provisioned and released require a lot of parallel process- buying an HPC system also of- with minimal management effort ing such as parameter studies ten requires a long and pain- or service provider interaction.” with varying input and low I/O ful internal procurement and NIST further explains that the requirements traffic. However, approval processes. The other cloud model is composed of many HPC applications can- option is to use a cloud solution “five essential characteristics” not be shoehorned into stan- that allows engineers and sci- (on-demand self-service, broad dard cloud solutions and con- entists to continue using their network access, resource pool- sequently require hardware regular computer system for ing, rapid elasticity and mea- designs that can efficiently run their daily design and develop- sured service), three “service certain HPC workloads from ment work and to “ burst” the models” (Software as a Service various science and engineer- larger, more complex jobs into (SaaS), Platform as a Service ing application areas. the HPC cloud as needed. In this way, users have access to Team 53 - Understanding fluid flow in microchannels with the insertion of a pil- virtually limitless HPC resourc- lar. Four variables characterize the simulation: channel height, pillar location, pillar diameter and Reynolds number, which allows for endless combinations. HighPerformanceComputing Americas | MARCH 2014 25

Many distributed computing applications are developed and optimized for specific HPC systems and require intensive communication between paral- lel tasks. To perform efficiently in an HPC cloud, these applica- tions require additional system features, such as:

• Large capacity and capability resources, application software choice and a physical or virtu- alized environment depending on performance needs. The use of high performance intercon- nects and dynamic provisioning can offer cloud features while maintaining HPC performance levels

• High performance I/O is often Team 36 - Advanced combustion modeling for Diesel engines. Simula- tion result showing the flame (red) located on top of the evaporating necessary to ensure that many fuel spray (light blue in the center). I/O-heavy HPC applications will run to their fullest potential. For can deliver low-cost and flexible required in order to determine example, pNFS might provide a HPC cycles; a careful analysis effective HPC performance in good plug-and-play interface of application requirements is standard cloud offerings. for many of these applications. Back-end storage design, how- ever, will be crucial in achieving 2 - Security in the cloud acceptable performance. Security is the major issue in but also a matter of trust, vis- • Fast network connection be- all cloud deployments and this ibility, control and compliance. tween the high performance concerns every organization, This is achieved through effec- cloud resources and the end- not just SMEs. In order to pro- tive management and opera- user’s desktop system. Scientif- tect its “crown jewels”, any orga- tional practices. In other words, ic and engineering simulation nization migrating to the cloud security is first and foremost a results often range from many should primarily be concerned human issue, because when Gigabytes to a few Terabytes. with the physical location of its end-users send data and ap- Additional solutions in this case data. plications into the cloud, they are remote visualization, data sacrifice complete control. compression, or even overnight It should also address a num- express mailing a disk with the ber of questions related to However, security does not ap- resulting data back to the end- back-up and recovery, auditing, pear to be a major issue for user (by the way, a quite secure certification and of course, se- cloud customers, as the market solution). curity. In particular, who on the is currently growing by approx- service provider’s side will have imately 30% per year. Optimis- Additionally, there may be oth- access to the data. Achieving tically, there is no reason to be- er issues that need to be ad- acceptable levels of security is lieve that HPC in the cloud will dressed before an HPC cloud not only a matter of technology, not follow the same trend. HighPerformanceComputing Americas | MARCH 2014 26

Team 52 - High-resolution simulations of blow-off in combustion systems. Predicted temperature contours field using OpenFOAM.

We could not agree more with Simon Aspinall [4] when he 3 - The HPC cloud market, says “as with any other new and providers and applications disruptive technology, there is still some hesitation and some as- While the general enterprise stantial increase from 13.8% in sumptions about how businesses cloud services market is cur- 2011. Of the HPC sites that use should run, that have until now rently reaching 5% of the over- cloud computing, half use pri- inhibited the speed of adoption all IT spending worldwide, HPC vate clouds and slightly more in the enterprise market. These in the cloud is still somewhat than half use public clouds arguments, especially around se- behind. The only market data (probably at an early stage). curity, are oddly similar to those currently publicly available for Early adopters include: govern- that were once voiced around the HPC clouds was produced by ment, manufacturers, bio-life Internet, online retail and even IDC for a worldwide study in- science, oil and gas, digital con- the mobile phone. Businesses not volving 905 HPC sites of various tent creation, financial services doing so already would be well sizes in government, academia and other high-end analytics, advised to take a page from his- and industry/commerce [5], online consumer services and tory and adapt or risk getting left a summary of which was pre- health care firms. Our own ob- behind by competitors already sented at the ISC Cloud Confer- servation is that the majority of benefiting from the efficiencies a ence in September 2013 in Hei- these early adopters are using Cloud solution delivers. As with delberg. cloud computing mainly for ex- any other major decision, the key ploring this new paradigm and is to educate ourselves on the Among other interesting data, how it might fit into their cur- available options, do diligence, the study showed that 23.5% of rent strategy. Only a small per- define our “Cloud,” and adapt at these sites used cloud comput- centage of them are actually a pace that is best for us.” ing of some type in 2013, a sub- using cloud in production. HighPerformanceComputing Americas | MARCH 2014 27

3.A - HPC cloud die [6], the worldwide worksta- believe is still valid today. This adoption for SMEs tion market is currently worth study shows that in 2010, 61% $7 billion in terms of revenue, of companies with over 10,000 In their study, IDC did not take with around 3.5 million units employees were using HPC, into account one significant shipped last year. That’s easily whereas only 8% of companies market segment: digital manu- in the neighborhood of 20 mil- with fewer than 100 employees facturing, in particular those lion users, mainly using work- were. only using desktop worksta- stations and PCs for their daily tions. This market is mainly design and development simu- Meanwhile, at the same time, composed of engineers and lations, making them poten- 72% of desktop CAE users saw scientists who need advanced tial candidates for HPC clouds, a competitive advantage in computing technologies for let alone those manufacturers adopting more advanced com- simulations and production who don’t use computing at all putation. Obviously, there are a in order to improve quality, (sometimes called the ‘miss- few compelling roadblocks to- achieve faster time-to-market ing middle’) or even the knowl- wards adopting HPC, especially and reduce costs. They can edgeable amateur who could in the mid-market. As previous- be found in a multitude of ar- do a lot with sufficient comput- ly mentioned, these roadblocks eas, such as structural analy- ing power, in the context of 3D are principally the TCO and sis, aerodynamics, fluid flows, printing for example. the cumbersome procurement crashworthiness, environment processes. Clearly, because of testing, stress testing, process A confirmation of this poten- the low penetration of HPC be- engineering and manufactur- tial can be found in an Inter- yond the desktop in the gener- ability. As previously noted, sect360/NCMS study on Model- al manufacturing industry, the they primarily use worksta- ing and Simulation at 260 U.S. use of HPC in the cloud is still in tions, but according to Jon Ped- Manufacturers [7], which we its infancy.

Team 30 - Heat Transfer Use Case. 2D room with a cold air inlet on the roof, a warm section on the floor and an outlet on a lateral wall near the floor. Picture shows the velocity and temperature fields at a certain time of the transient simulation. HighPerformanceComputing Americas | MARCH 2014 28

Team 54 - Analysis of a pool in a desalinization plant with complex air-water free-surface modeling.

And yet, there are rising hopes time on HPC systems. And, last While commercial CSPs evolve, that this situation will change. but not least, there are ameri- a growing number of supercom- First and foremost, because the can and international initiatives puting centers are becoming technical press regularly ad- like the ‘Missing Middle’, the interested in offering access to dresses the various roadblocks National Center for Manufac- their HPC clusters expertise to and helps raise awareness of turing Sciences, the UberCloud local and/or regional industry the many benefits of in-house HPC Experiment that help pro- players on a pay-per-use basis. HPC and HPC as a Service (HP- mote the concept of HPC in the Examples include Georgia Tech, CaaS) in the SME market of cloud. NCSA, Ohio Supercomputing the manufacturing industry. Center, Rutgers University the Secondly, because there are 3.B - A growing number of San Diego Supercomputer Cen- several important trends and HPC cloud service providers ter in the US, the CESGA Super- initiatives that foster HPC in computing Centre and FCSCL the cloud for this same mar- Over the past five years, -hun in Spain, CILEA in Italy, GRNET ket. Thirdly, because there is a dreds of cloud services pro- in Greece, HSR in Switzerland, growing acceptance of general viders (CSPs) have entered Monash University in Australia, enterprise cloud computing so- the market, offering hardware SARA in The Netherlands and lutions such as ERPs or CRMs, resources, software and ex- the Mesocentres initiative in which will eventually contribute pertise. To name just a few France. to the adoption of HPC-related resource providers: Amazon services for engineering simu- AWS, CloudSigma, Fujitsu TC In the late 90s, many digital lations on-demand. Cloud, GOMPUTE, GreenBut- manufacturing ISVs tried to ton, HP, MEGWARE, Microsoft get into the Application Service Additionally, large manufactur- Azure, Nimbix, OCF, Oxalya/ Provider business by offering ing companies expect their sup- OVH, Penguin Computing, Res- access to their solutions via the ply chain partners to perform cale, Sabalcore, Serviware/Bull, web. Most of them failed be- more and more high-quality SGI, SICOS, TotalCAE and trans- cause of severe (or rather insur- end-to-end simulations in less tec (more providers here). mountable) technical or busi- HighPerformanceComputing Americas | MARCH 2014 29

ness roadblocks that couldn’t mentioned benefits are very increased on-demand license be overcome at that time. While likely to also continue using business. Not to mention that many of the barriers have been their workstations for day-to- ISVs offering on-demand ap- resolved or at least softened, in day standard R&D scenarios. plication software will be able the meantime many of these to attract new customers who ISVs remain concerned that The cloud offers increased op- are just beginning to work with the cloud computing paradigm portunities for engineers to do computer simulations and who could potentially disrupt their more, faster and better simu- would have never considered existing business model based lations (and products) - see buying a license for just a few on selling annual node-locked the example in [8]. For these simulations. or floating licenses. The ma- scenarios, ISVs will sell on- jority of them still believe that demand, pay-per-use, hourly 3.C - HPC cloud customers might turn to on-de- rated or subscription tokens, software licensing mand cloud-based pay-per-use resulting in additional business software services instead of as well as workstation license HPC cloud software licensing is buying and using software on revenues. Even for those com- still considered one of the major their own systems. panies which already have an barriers for the wider adoption HPC cluster in their computing of cloud computing. This is par- While this apprehension is un- center, bursting into a cloud of- ticularly true for SME manufac- derstandable, the reality may fers higher efficiency and flex- turers, as shown by a poll dur- be totally the opposite… In ibility for the engineering team ing an UberCloud Webinar in fact, engineers who turn to the and for the computing center, June 2013: the ISVs’ slow adop- cloud because of the above- which will more likely result in tion of more flexible (on-de-

Team 26 - Development of stents for a narrowed artery. HighPerformanceComputing Americas | MARCH 2014 30

Team 56 - Simulating the performance of an axial fan in a duct. Typical results (speedup over number of cores) of a simulation on a workstation (12 cores, green dot), in-house HPC Cluster (16 and 24 cores, blue line) and HPC Cloud (12, 24 and 48 cores, red line). mand) licensing models for the desys, Flow Science, Foldyne each parameter job running on cloud was the major concern Research, Friendship Systems, separate servers (with moder- for 61% of the respondents. Gompute, GPU Systems, HCL ate parallelism on the servers’ At the same time, things are Infoystems, HPC Solves, HPC cores). These applications are a-changing. On the one hand, Sphere, Kitware, Kuava, Land- well-suited for standard enter- the very same ISVs that have mark (Halliburton), MapR Tech- prise cloud servers. been traditionally reluctant to nologies, micrOcost, migenius, provide on-demand licenses MSC Software, Nice Software, On the other end of the spec- are now seriously considering Nimbus Informatics, Numer- trum, parallel (distributed) ap- them (ANSYS, SIMULIA...) or are ate, Open Source Research In- plications have tightly-coupled already offering SaaS services stitute, Ozen Engineering, Per- communication needs among (Autodesk Sim360, CD-adapco’s sonal Peptides, Phenosystems, parallel tasks and/or high scal- Power on Demand...). And on PlayPinion, Qtility Software, ability requirements that would the other hand, a large number QuantConnect, Rescale, RMC run perfectly on specialized of major software providers - Software, SimScale, SIMULIA, HPC servers. A third class of especially the ones involved in Stillwater Supercomputing, TE- codes has mid-size, mid-scale digital manufacturing or HPC CIC, TotalSim, TYCHO, Univa parallel requirements that can tools with cloud features - are and Visual Solutions... easily run on one server’s par- currently participating in the allel cores. UberCloud HPC Experiment. 3.D - Application performance in the cloud Three components are neces- Among them: Acellera, Adap- sary for a cloud datacenter to tive Computing, Advanced Clus- The HPC application spectrum provide a single-point of access ter Systems, Advection Tech- is large. On one end, there to this kind of heterogeneous nologies, AMPS Technologies, are the so-called massively (or HPC cloud composed of enter- ANSYS, Artes Calculi, Autodesk, ‘pleasingly’) parallel applica- prise and HPC servers: a user- BGI, BlackDog Endeavors, tions like parameter studies in friendly portal providing access Bright Computing, CAELinux, digital manufacturing and drug to the cloud, an intelligent re- CEI, Certara, CHAM, Ciespace, design. In these cases, the code source manager scheduling the CloudioSphere, Cloudsoft, Cl- runs many instances in paral- different application jobs to the oudyn, eXact, CPUsage, Cycle lel on many cores, with each appropriate systems within the Computing, Datadvance, ELEKS, parameter job running on one HPC cloud and computing re- Equalis, ESI Group, ESTECO, Ex- core; or, more globally, applica- sources, ideally preloaded with pert Engineering Solutions, Fi- tions run on many servers, with the requested codes. HighPerformanceComputing Americas | MARCH 2014 31

4 - The cost model for in-house vs HPC cluster with two 16-core in-cloud high performance computing nodes, to accommodate all of the 32-core jobs. This cluster is According to an IDC study [2], (core hour). Googling “average just 12.5% of the size of the big only 7% of the total cost asso- server utilization” returns aver- 256-core cluster with $1,000K ciated with acquiring and op- age utilization rates between three-year TCO, i.e. $42K per erating an HPC system comes 5% and 20% (we know some are year. With an excellent cluster from the hardware. A much higher, especially in dedicated utilization of say 92%, the core larger portion comes from the scientific supercomputer cen- hour amounts to $0.16 on this high cost of expertise (staff- ters). The table below shows 2-node in-house cluster. If the ing), equipment, maintenance, the total cost of one core hour remaining 256-core 16-node software and training required for a 16-node (256-core) in- jobs run in the cloud, for about to run such systems, which ex- house HPC cluster depending one month per year (we have to plains the high TCO. on utilization rates (or “number choose one month to achieve of busy nodes”) of this cluster. the above 20% utilization rate 4.A - The cost of the full in-house cluster), of an HPC system 4.B - The cost the core hour is again $0.20, or of HPC in the cloud $37K per month. To estimate what it really costs to own and operate an HPC To get to the real cost of HPC The combined expenditure of System, let’s look at a realistic in the cloud, let’s consider the in-house and in-cloud costs scenario. Let’s assume that a cost of one core per hour which ($42K and $37k, respectively) company needs to run a vari- amounts roughly to $0.20 (this results in $79K per year for ety of simulation jobs, most of is a typical price at AWS for the hybrid solution, compared them mobilizing 32 cores (for example, if you exclude addi- with $90K for a full HPC in the example, 10 million finite ele- tional services and application cloud service and $330K for the ments) and some of them with software of course). Thus, the in-house cluster. If the decision finer-grained geometry mobi- workload for a 20% utilized clus- is a matter of cost and nothing

Busy nodes 1 2 3 4 5 6 8 12 16 Utilization, % 6.3 12.5 18.8 25.0 31.3 37.5 50.0 75.0 100 Cost: 1 core / h, $ 2.36 1.19 0.79 0.59 0.47 0.40 0.30 0.19 0.15

Table 1 - Cost per HPC node depending on utilization rates. Source: IDC. lizing 256 cores (e.g. 100 million finite elements). The company ter is equivalent to 256 cores * else, the hybrid and the cloud has to buy a 16-node system, 24 hours * 365 days * 20% = solutions surpass the in-house each node having 16 cores, to 448,512 core hours * $0.20... or HPC cluster by at least a fac- be able to perform the 32-core $90K all in all. tor of 3 for a ‘standard’ 20% as well as the 256-core runs. usage rate, or less. According 4.C - The cost to the table above, the cost of Let’s say the price of this kind of a hybrid solution the in-cloud solution ($0.20 per of a system would amount to core hour) exceeds that of the $70K, or 7% of the TCO. Ac- Let’s go back to our scenario of in-house solution only above cording to IDC, the Total Cost a company that needs to run an average 75% utilization rate of Ownership of such a system a mix of 32-core and 256-core of the latter - a figure typical is $1,000K over three years, or simulation jobs at an average of academic Supercomputing $330K per year for 256 cores, 20% system utilization. Let’s Centers serving hundreds or or $1,289 per core per year, or also assume the 32-core work- thousands of users, but not of $0.15 for one core per hour load utilizes a small in-house average private organizations. HighPerformanceComputing Americas | MARCH 2014 32

5 - The UberCloud HPC Experiment accelerates HPC in the cloud and adoption of cloud comput- ing in HPC and digital manufac- In theory, Cloud Computing 5.A - The UberCloud HPC turing is still very slow. This is and its emerging technologies Experiment History mainly due to obstacles such such as virtualization, web ac- as inflexible software licens- cess platforms and their in- Inspired by the results of the ing, slow data transfer, security tegrated toolboxes, solution Magellan Report [12], the of data and applications and stacks accessible on-demand, UberCloud HPC Experiment a lack of specific architectural automatic cloud bursting capa- idea arose in May 2012 while features (resulting in reduced bilities, etc., enable researchers we were comparing the ac- performance in the cloud). The and industries to use additional ceptance of cloud computing idea of the UberCloud HPC Ex- computing resources in a flex- for typical enterprises applica- periment was therefore to find ible and affordable on-demand tions in contrast to High Perfor- out more about the end-to-end way. Now, the UberCloud HPC mance Computing. process of bringing engineer- Experiment provides a plat- ing applications to the cloud form for researchers and en- While the adoption of cloud com- and, in doing so, to learn more gineers to explore, learn and puting in the enterprise market about the real difficulties and understand the end-to-end is rapidly growing (41.3% per the best ways to overcome process of accessing and us- year through 2016, according them. The Experiment started ing HPC clouds, to identify the to Gartner [13]), the awareness in July 2012. issues and resolve the road- blocks (more in [9]). The princi- Team 34 - CFD simulation of a vertical wind turbine with 3 helical rotors. ple is that end-users, software / resource providers and HPC experts collaborate in teams to jointly solve the end-user’s ap- plication problems in the cloud.

Since July 2012, the UberCloud HPC Experiment has attracted 1,100+ organizations from over 66 countries (as of February 2014). To date, the organizers have been able to build 125 of these teams, in CFD, FEM, Com- putational Biology and other prominent HPC domains, and to publish more than 60 articles about the UberCloud initiative, including numerous case stud- ies about the different appli- cations and lessons learned. Recently, UberCloud TechTalk and a virtual Exhibition [10] have been added along with a Compendium that includes 25 case studies from digital manu- facturing in the Cloud [11]. HighPerformanceComputing Americas | MARCH 2014 33

Team 14 - Electric field distribution at 900MHz for 1W excited power displayed on model surface and on a cutplane. The anatomical human voxel model (“Duke”) was realistically posed on a car seat in front of a dipole antenna representing a mobile device with hands-free equipment discretized with 112 million mesh cells. Simulation performed with transient solver of CST MICROWAVE STUDIO based on the Finite Integration Technique (FIT).

5.B - A practical approach The industry end-user - A typ- fers rock-solid software, which ical example is a small or me- has the potential to be used on The technology components of dium size manufacturer in the a wider scale. For the purpose HPCaaS that enable remote ac- process of designing, proto- of this experiment, on-demand cess to centralized resources typing and developing its next- license usage is tracked in or- in a multi-tenant way and their generation product. These der to determine the feasibility metered use are not unfamiliar users are prime candidates of using the service model as a to the HPC research and engi- for HPC-as-a-Service when in- revenue stream. neering community. However, house computation on work- as service-based delivery mod- stations becomes too lengthy The resource provider - This els take off, users have been and acquiring additional com- pertains to anyone who owns mostly on the fence, observing puting power in the form of an HPC resources (such as com- and discussing the potential HPC server is too cumbersome puters and storage) and is net- hurdles to its adoption. What or not in line with IT budgets. worked to the outside world in is fairly certain is that we now Plus, HPC is not likely to be the an open way. A classic HPC cen- have the technology ingredi- core expertise of this group. ter would fall into this category ents to make it a reality. Let’s as well as a standard datacen- start by defining what roles The application software pro- ter used to handle batch jobs each stakeholder (industrial vider - This includes software or a cluster-owning commer- end-users, resource providers, owners of all shapes and sizes, cial entity that would be willing software providers and high including ISVs, public domain to provide compute cycles to performance computing ex- software organizations and in- run non-competitive workloads perts) has to play to make ser- dividual developers. The Uber- during periods of low CPU-utili- vice-based HPC come together: Cloud Experiment usually pre- zation. HighPerformanceComputing Americas | MARCH 2014 34

Team 46 - CAE simulation of water flow around a ship hull. The picture shows the pressure distribution on the hull surface.

The HPC experts - This group result. This suggests a certain to the degree of user-friend- includes individuals and com- software stack, domain exper- liness this process provides. panies with HPC expertise, es- tise and even hardware con- The experiment orchestrators pecially in the areas of cluster figuration. The general idea is will then analyze the feedback management and software to look at the end-user’s tasks received. Finally, the team will porting. It also encompass- and software and select the ap- get together, extract lessons es PhD-level domain special- propriate resources and exper- learned and present further ists with in-depth application tise that match specific require- recommendations as input to knowledge. In the Experiment, ments. their case study. these experts acting as team leaders, work with end-users, Then, with modest guidance 5.C - An experiment in progress computer centers and software from the UberCloud Experi- providers to help the pieces of ment team, the user, resource For the 125 teams from 66 the puzzle fit together. provider and HPC expert will countries (with over 1,100 ac- implement and run the task tive and passive organizations For example, suppose the user and deliver the results back to involved), the end-to-end pro- is in need of additional resourc- the end-user. The hardware cess of taking applications to es to increase the quality of a and software providers will the cloud, performing the com- product design or to speed up measure resource usage; the putations and bringing the re- a product design cycle - say HPC expert will summarize the sulting data back to the end- for simulating sophisticated steps of analysis and imple- user has been partitioned into geometries or physics or for mentation; the end-user will 23 individual steps which the running copious amounts of evaluate the quality of the pro- teams closely follow on the simulations for a higher quality cess and the results, in addition Basecamp collaboration envi- HighPerformanceComputing Americas | MARCH 2014 35

ronment. An UberCloud Uni- cloud related services or select Wim Slagter from ANSYS en- versity (“TechTalk”) has been the services that they want to titled: “On Cloud Nine” [14]). created providing regular edu- use for their team experiment Finally, at the November 2013 cational lectures for the com- or for their daily work. Besides, Supercomputing Conference munity. And the one-stop Uber- many UberCloud HPC Experi- in Denver, The UberCloud re- Cloud Exhibit [10] offers an HPC ment teams publish their re- ceived the HPCwire Readers Services catalog where commu- sults widely (for example, the Choice Award for the best HPC nity members can exhibit their article by Sam Zakrzewski and cloud implementation [15].

6 - HPC cloud Use Cases from the UberCloud HPC Experiment

As a glimpse into the wealth of Team 1: Heavy-duty source provider Steve Hebert practical use case results so far, structural analysis using from Nimbix and team expert we chose to present six out of HPC in the cloud. Sharan Kalwani from (at the 125 UberCloud Experiment re- time of this experiment) In- sults demonstrating the wide This first team consisted of en- tel. The functions of this team spectrum of CAE applications gineer Frank Ding from Simp- range from solving anchorage in the cloud (more in the Teams son Strong-Tie, software pro- tensile capacity and steel and pages of the UberCloud’s web- vider Matt Dunbar with Abaqus wood connector load capacity site). software from SIMULIA, re- to special moment frame cy-

Team 1 - Structural analysis model using HPC in the cloud. The picture shows concrete anchor bolt tension capacity simulation with 1.9 million degrees of freedom. HighPerformanceComputing Americas | MARCH 2014 36

clic pushover analysis. The HPC cluster at Simpson Strong-Tie is modest (32 cores) so when emergencies arise, the need for cloud bursting is critical. Another challenge is the abil- ity to handle sudden large data transfers, as well as the need to perform visualization to en- sure that the design simulation is proceeding along the correct lines.

Team 2: Simulating new probe design for a medical device.

The end user’s corporation is one of the world’s leading ana- lytical instrumentation compa- Team 2 - Simulation model of part of the probe to equip nies. They use computer-aided a new generation of medical device. engineering for virtual proto- typing and design optimization Additional computing resourc- auction-like marketplace offer- for sensors and antenna sys- es would allow the end user to ing significant cost savings over tems used in medical imaging greatly expand the sensitivity traditional on-demand hourly devices. Other participants in of current simulations and may prices. this team were software pro- enable new product and design vider Felix Wolfheimer from initiatives previously written off Team 8: Flash dryer Computer Simulation Tech- as “untestable“. simulation with hot gas Used nology, and Chris Dagdigian, to evaporate water from a CST and team expert from The Hybrid cloud-bursting architec- solid BioTeam, Inc. The team used ture permits local computing Amazon AWS Cloud resources. resources residing at the end- This team consisted of Sam Za- user site to be utilized along krzewski from FLSmidth, Wim From time to time, the end-us- with Amazon cloud-based re- Slagter from ANSYS as software er needs large compute capac- sources. The project explored provider, Marc Levrier from ity in order to simulate and re- the scaling limits of the Amazon Serviware/Bull as resource fine potential product changes EC2 instances and scaling runs provider and HPC expert Ingo and improvements. The peri- designed to test computing Seipp from Science + Comput- odic nature of the computing task distribution via the Mes- ing. In this project, Computa- requirements makes it difficult sage Passing Interface (MPI). tional Fluid Dynamics (CFD) to justify the capital expendi- The use of MPI allows the lever- multiphase flow models were ture for complex assets that aging of different EC2 instance used to simulate a flash dryer will likely end up sitting idle for type configurations. using CFD tools that are part of long periods of time. To date, the end-user’s extensive CAE the company has invested in The team also tested the use portfolio. On the in-house serv- a modest amount of internal of the Amazon EC2 Spot Mar- er, the flow model took about computing capacity sufficient ket in which cloud-based as- five days for a realistic particle- to meet base requirements. sets can be obtained from an loading scenario. ANSYS CFX 14 HighPerformanceComputing Americas | MARCH 2014 37

was used as the solver. Simula- tions for this problem were us- ing 1.4 million cells, five species and a time step of 1 millisecond for a total time of 2 seconds. A cloud solution allowed the end- user to run the models faster to increase the turnover of sen- sitivity analyses. It also allowed the end-user to focus on engi- neering aspects instead of us- ing valuable time on IT and in- frastructure problems.

Team 40: Simulation of Spatial Hearing

This team consisted of an anonymous engineer end- user from a manufacturer of consumer products, software providers Antti Vanne, Kimmo Tuppurainen and Tomi Hut- tunen from Kuava and HPC ex- perts Ville Pulkki and Marko Hi- ipakka from Aalto University in (Above) Team 8 - Flash dryer model viewed with ANSYS CFD-Post. Finland. The resource provider (Below) Team 40 - Dots in the simulation model indicate all locations of mono- was Amazon AWS. pole sound sources used in the simulations. Red dots are sound sources. The middle section shows the sound pressure level (SPL) in the left ear as a function The human perception of sound of the sound direction and the frequency. On the right, the SPL relative to sound is a personal experience. Spa- sources in the far-field is shown. tial hearing (i.e. the capability to distinguish the direction of sound) depends on the individ- ual shape of the torso, head and pinna (the so-called head-relat- ed transfer function, or HRTF). To produce directional sounds via headphones, one needs to use HRTF filters that “model” sound propagation in the vi- cinity of the ear. These filters can be generated using com- puter simulations, but to date, the computational challenges of simulating HRTFs have been enormous. This project inves- tigated the fast generation of HRTFs using simulations in the HighPerformanceComputing Americas | MARCH 2014 38

Team 58 - Simulating wind tunnel flow around bicycle and rider. cloud. The simulation meth- neers’ productivity and enabled an in-house FE element code. od relied on an extremely fast them to make better decisions. This conjugate heat transfer boundary element solver. Using a cloud-based solution to process is very consuming in meet the HPC requirements of terms of computing power, es- Team 58: Simulating wind computationally intensive ap- pecially when 3D-CFD models tunnel flow around bicycle plications decreased the turn- with more than 10 million cells and rider around time in iterative design are required. Consequently, it scenarios and significantly re- was thought that using cloud The team consisted of end- duced the overall cost of the resources would have a benefi- user Mio Suzuki from Trek Bi- design. cial effect regarding computing cycle, software provider and time. HPC expert Mihai Pruna from Team 118: Conjugate heat CADNexus and resource pro- transfer for the design of jet Two main challenges had to be vider Kevin Van Workum from engines in the cloud addressed: bringing software Sabalcore Computing. CAPRI from two different providers to OpenFOAM Connector and This team consisted of the end with two different licensing the Sabalcore HPC Computing user, thermal analyst Hubert models together, and getting Cloud infrastructure were used Dengg from Rolls-Royce Ger- the Fluent process to run on to analyze the airflow around many, the resource providers several machines when called bicycle design iterations from Thomas Gropp and Alexander from the FE software. After a Trek Bicycle. Heine from CPU 24/7, software few adjustments, the coupling provider ANSYS, Inc. represent- procedure worked as expected. The goal was to establish a ed by Wim Slagter, and HPC/ greater synergy among itera- CAE expert Marius Swoboda From the end user´s point of tive CAD design, CFD analysis from Rolls-Royce in Germany. view there were multiple ad- and HPC cloud. Automating it- vantages of using the external erative design changes in CAD The aim of this HPC experiment cluster resources, especially models coupled with CFD sig- was to link the commercial when the in-house computing nificantly enhanced the engi- CFD code ANSYS Fluent with resources were already at their HighPerformanceComputing Americas | MARCH 2014 39

limit. Running bigger models in the cloud (in a shorter time) gave more precise insights in- to the physical behavior of the system. The end user also ben- efited from the HPC cloud pro- vider’s knowledge of how to setup a cluster, to run applica- tions in parallel based on MPI, to create a host file, to handle the FlexNet licenses and to prepare everything needed for turn-key access to the cluster.

Team 118 - Contours of Heat Flux.

REFERENCES

[1] Council of Competitiveness, ‘Make’, ‘Reflect’, ‘Reveal’ and ‘Compete’ studies 2010.

[2] Randy Perry and Al Gillen. Demonstrating business value: Selling to your C-level executives. IDC White Paper, April 2007.

[3] NIST Cloud Definition

[4] Simon Aspinall. Security in the Cloud. Strategic News Service, 16-39, October 28, 2013.

[5] Steve Conway et al. IDC HPC End-User Special Study of Cloud Use in Technical Computing. June 2013.

[6] Jon Peddie. Workstations continue solid growth in Q3’13.

[7] Intersect360/NCMS study on Modeling and Simulation at 260 U.S. Manufacturers, July 2010.

[8] Wolfgang Gentzsch, Why Are CAE Software Vendors Moving to the HPC Cloud?

[9] Wolfgang Gentzsch and Burak Yenier. The Uber-Cloud Experiment. HPCwire, June 28, 2012.

[10] The UberCloud Services Exhibit.

[11] The UberCloud HPC Experiment: Compendium of Case Studies. Intel, June 25, 2013.

[12] Katherine Yelick, Susan Coghlan, Brent Draney, Richard Shane Canon. The Magellan Report on Cloud Computing for Science, Dec. 2011.

[13] Gartner Predicts Infrastructure Services Will Accelerate Cloud Computing Growth. Forbes Feb. 2013.

[14] Sam Zakrzewski and Wim Slagter. On Cloud Nine. Digital Manufacturing Report. April 22, 2013.

[15] The UberCloud Receives Top Honors in 2013 HPCwire Readers’ Choice Award, November 27, 2013. ISC’14 THE HPC EVENT June 22 – 26, 2014, Leipzig, Germany

Join the Global Supercomputing Community

www.isc-events.com/isc14

Unbenannt-4 1 22.10.2013 10:15:08 HighPerformanceComputing Americas | MARCH 2014 41 /verbatim

CATHERINE RIVIERE PRESIDENT & CEO, GENCI CHAIRWOMAN, PRACE COUNCIL

Interview by STEPHANE BIHAN

Mrs. RIVIERE, as CEO of GENCI was created in 2007 by the On the agenda: and Chairwoman of the PRACE French authorities following a Council, could you briefly pres- 2005 report pointing out that • The foundations of HPC ent these two organizations and France was lagging behind in in Europe tell us specifically how they in- providing computing capacity • Promoting simulation for teract to develop high-perfor- to our scientists, with the re- SMEs’ competitiveness mance computing in France and sult that they were leaving to Europe? work abroad. • Is Europe to be a global leader or follower GENCI (Grand Equipement HPC is a strategic tool to main- in supercomputing? National du Calcul Intensif, or tain the competitiveness of National High Performance a country. As such, the lack • PRACE at Horizon 2020 Computing Infrastructure) of resources presents a real ‘‘

HighPerformanceComputing Americas | MARCH 2014 42

Twenty european countries have decided to join forces in order to implement a distribu- ted computing infrastructure.

problem for academics and, already begun. Not only was In budgetary terms, GENCI in the long term, for industry. it mentioned in the report represents €30 million per This is why it was necessary to on French policy on scientific year and PRACE €530 million put in place a genuine policy computing, but the first Sci- over five years until 2015. in this sector. The administra- entific Case for PRACE in 2005 These €530 million include tion therefore decided to cre- had already proposed a road- €400 million from the four ate a specific organization, map for high performance machine hosting countries - GENCI, to place France at the computing at the European France, Germany, Spain and highest level in Europe and in- level. This document stated, Italy - on one hand and, on ternationally. inter alia, that at the time, no the other hand, €130 million - single country could compete €60 million from the Commis- GENCI has three main mis- with the United States or Ja- sion and €70 million from the sions: the first is to lead and pan. China was also on the 25 other representatives. support the national strategy list, but not at the level where for the provision of comput- it is today. What is the total power of the ing resources for scientific re- computing infrastructure avail- search. The second consists It was therefore essential that able to the European scientific of representing France in any Europe mobilize. This was re- community? European governmental ini- alized at both the national and tiative, i.e. any initiative that European levels. France had a At the national level, 1.6 Pflops requires national representa- real problem of scientific com- are distributed among three tion. Finally, GENCI’s third mis- petitiveness, and knowing computing centers: the TGCC sion is to promote simulation that no member of the Union center at the CEA, the CINES and HPC for research in both could compete with the major center in Montpellier, and the the academic and industrial nations, we needed to unite. In IDRIS center at Orsay. Since sectors. GENCI is a non-trad- 2008, the European Commis- 2008, GENCI has updated the ing company held by the Min- sion, which was of course anx- equipment at these three cen- istry of Research and Higher ious that Europe should play ters to reach the 1.6 Pflops in Education (at 49%), by CEA its role in high-performance question: 280 Tflops at CINES, and CNRS at 20% each, by the computing, launched a prepa- the first GENCI installation CPU (Conference of University ratory phase project, PRACE, in 2008; more recently, IDRIS Presidents) (at 10%), and final- under the FP7 program, to de- with approximately 1 Pflops; ly, by INRIA at 1%. velop an infrastructure for re- and finally, the CURIE ma- search at the European level. chine, which accounts for 2 At the European level, PRACE This initial preparatory phase Pflops. However, 80% of the (Partnership for Advanced project was followed in 2010 latter are used for the French Computing in Europe) did not by implementation phases commitment to PRACE, the re- yet exist when GENCI was PRACE-1IP, 2IP, and 3IP, the maining 20% being counted as founded, but its creation had current phase. national computing resources. HighPerformanceComputing Americas | MARCH 2014 43

MareNostrum in Barcelone (BSC).

At the European level, four access, is intended to conduct So PRACE doesn’t finance the countries provide 15 Pflops, tests for the second type of purchase of machines itself? corresponding to a total of access, called “classic,” which €100 million per year for each extends over one year on No. France, Spain, Italy and of them. France provides the the basis of computing hours Germany, the four countries CURIE machine (CEA); Spain granted. Finally, the last type who were the first to bet on provides the MareNostrum of access is called "multiyear" European computing, finance machine (BSC) for 1 Pflops; It- and concerns large projects the purchase of these ma- aly, the Fermi computer (CIn- on a European scale. chines with their own funds. eca) for 2 Pflops; and last, Ger- many provides access to three Can you describe the PRACE in- Is what we are we talking about machines: Hermit in Stuttgart frastructure selection and con- here only providing resources or for 1 Pflops, JUQUEEN in Jülich struction process for us? is there also ongoing follow-up for 6 Pflops and SuperMUC in and support of scientific proj- Garching for 3.2 Pflops. At GENCI, we issue traditional ects? Where appropriate, can competitive calls for tender you describe how user support How is access to these resources based on our users’ needs. As is organized? organized? far as PRACE is concerned, the machines belong to the part- PRACE also meets its users’ Access to these resources is ners, and the partners are needs with a service and sup- open to scientists and indus- therefore free to choose the port environment such as as- try users based on the sole cri- purchasing process for their sistance in porting code. We terion of scientific excellence machines. However, the goal set up six training centers assessed by a Committee, the is still to develop a European whose purpose is to grow the Access Committee, consisting infrastructure consisting of computing ecosystem around of renowned scientists. PRACE different machines in order to PRACE supercomputers, to- offers three types of access: meet all of the (very different) gether with the 25 other part- the first, called "preparatory" users’ needs. ner countries. ‘‘

HighPerformanceComputing Americas | MARCH 2014 44

In order for Europe to remain competitive, PRACE has to become a sustainable infra- structure, which is not quite the case just yet.

HydrOcean, a start-up backed So specifically, it consists of both own network of experts. All by GENCI through the HPC - SME financial and technical support? of this exists in order to ad- initiative, received the presti- dress all of the problems fac- gious IDC HPC Innovation Excel- You know, simply giving an ing SMEs - most of them being lence Award for scientific excel- SME the opportunity to com- unfamiliar with high perfor- lence in fluid flows simulation. pute on a machine is of no mance computing. In concrete terms, what was this great interest, especially if it support? does not know a priori what However, the icing on the cake that means. On the other was the launch in 2010, as part This is a project that I am very hand, a proactive approach to of Future Investments, of the attached to. HPC and digital the question allows us to ask Equipex Equip@meso project simulation are essential tools SMEs whether digital simula- to make simulation and HPC to increase competitiveness, tion can help them, if so, how, a vector of scientific and tech- and so we needed to create a and only then to show them nological development. Under plan specifically for SMEs - the how. It is a proof of concept, GENCI’s leadership, Equip@ HPC-SME initiative. Promot- a real demonstration, with meso was to have coordinat- ing the use of simulation and custom assistance. Thanks to ed investments on the order HPC by industry falls squarely their sound understanding of of ten "mesocenters" (i.e. tier- within GENCI’s mission. SMEs, competitiveness clus- 2 regional supercomputing ters, particularly those who centers). However, by becom- The objective is to help these include digital simulation in ing coordinator of equipment SMEs ask themselves whe- their approach, have naturally that did not fall within its ther using simulation will in- been involved. scope of action, GENCI’s role crease their competitiveness. was in some sense limited to How can we help them to an- The next step was to accompa- that of banker. swer that question? With a ny SMEs with experts who are pragmatic approach whose familiar both with the world We therefore proposed to sole aim is to meet each SME’s of digital simulation and the Equip@meso that we help needs – not to sell them an in- business world, i.e. the EPICs them win the Equipex tender novative method. This implies such as IFP Energies Nouvelles by participating in scientific listening and finding the most and ONERA, who supported leadership in the network, by suitable solution for them, us right away, and the CNRS. disseminating regionally the knowing that the final decision best practices that we had ac- to use the resources is theirs. Overall, one must understand quired at the national and Eu- This initiative, out of the com- this strategy as a combination ropean levels and, above all, bined efforts of the GENCI and of competitiveness clusters, by asking them to expand the INRIA, was immediately sup- which disseminate and reach HPC - SME initiative locally. In ported by BPI France, which out at the regional level, and this way we succeeded in de- allowed us to launch the pro- institutions that provide ex- veloping this initiative by rely- gram as early as 2010. pertise in addition to GENCI’s ing on regional mesocenters. HighPerformanceComputing Americas | MARCH 2014 45

SuperMUC in Garching (LRZ).

Finally, increasing the com- ing resources, it increased its a transfer of R&D. If it had not petitiveness of our companies software performance 500%. received free access to major truly consists of spreading the By working five times faster, it computing resources, it would good word about digital simu- can therefore potentially earn probably have been unable to lation and HPC among them in five times more contracts. This respond to the call for tenders order to lead them there from really is efficiency, not human in Japan. a practical and technological effort. It should be noted that perspective. SMEs have trust- HydrOcean is the first French The HPC-SME initiative has ed this approach because it is SME to have benefited from attracted considerable inter- open and GENCI has remained 13 million hours of computing est among many European neutral: we open the door to on a German PRACE machine. projects. We have therefore possibilities, but in the end, Which, by the way, allowed it included it in the PRACE-3IP they are still in control of their to win a major contract across implementation phase in or- own decisions. the Rhine. der to deploy it at the Union level; it will lead to a certain And so, in addition to HydrOcean To give you another example, number of synergies without mentioned earlier, could you I would like to mention Nexio which nothing ambitious will give us another example of Simulation, a specialist in aero- be possible. success resulting from this ap- nautics electromagnetism proach? based in Toulouse. Thanks to With the ETP4HPC in Europe and the support of HPC-SME and the HPC and Simulation call for Since 2010, we have support- the CALMIP mesocenter, a proposals in France, European ed no fewer than 43 SMEs partner of Equip@meso, it has and French authorities are fi- with some outstanding suc- just gotten two major con- nally realizing the strategic role cess stories like HydrOcean: tracts in Japan. I would like to of HPC in innovation and com- thanks to parallelization and emphasize that we are talking petitiveness. In your opinion, access to national comput- about custom support and not what has led to this awareness? HighPerformanceComputing Americas | MARCH 2014 46

It comes simply from the els. Clearly, even our Ameri- However, all is not settled, comparison with the United can friends had not made such and it is important that the States, which invests millions an allocation. Without PRACE, PRACE model become a sus- of dollars in this sector each this project would not have tainable European infrastruc- year. But it also results from been possible. Computing cy- ture, which is not quite the the vigorous mobilization of cles are allocated to projects case just yet. The financial the scientific community in that are on a larger scale than commitment for PRACE will France and Europe to alert those that we can undertake run until 2015 - that is to say the public authorities that our at the national level. PRACE tomorrow. However, we are competitiveness, our innova- has a Scientific Council, an in- building PRACE’s future today, tive energy, and our interna- dependent body that judges and we will need a different tional expansion are all falling projects on the basis of their form of financial support, be- behind. It took this huge com- scientific excellence, and that cause four countries alone will munication effort to raise our made the bet to allocate its not be able to support access leaders’ awareness of the stra- computing hours to impor- to HPC resources for all the tegic role of digital simulation. tant projects simply because countries of the Union. Other We should also emphasize the request was exceptional. sources of funding will have to that from now on, high perfor- This is how Europe draws its be found, whether they come mance computing will be in- partner countries upwards from the European Commis- dispensable in many sectors. and onwards, by making sci- sion or other partners, to sup- This is the case in health care, entific investments that are port both the investment in for example, where legislation more easily achievable at the the machines and their oper- prohibiting animal testing is a European level than at the na- ating costs. real catalyst for the simulation tional level. of living creatures and the cre- At the French level, when GEN- ation of new molecules. Let’s take a minute to focus on CI was founded in 2007, the our governing bodies. What initial budget was €27 million, In this regard, how well do you are the ambitions of high per- and the plan was to double it think the national and Europe- formance computing policies? over the next three years. To- an programs complement one And do you think the allocated day it is still €30 million. We another? In what ways aren’t resources are adequate with re- are of course fully aware of they simply competing with one spect to these ambitions? the difficulties related to the another? current economic context; At the European level, Europe they underline the absolute In Europe, the commonly does have a genuine political necessity to unite at the Euro- used term is “subsidiarity.” will. This is even the first time pean level. What can’t be achieved at the that such a will has been dis- national level is achieved at played in the HPC area. The PRACE’s role as a pillar of the the European level. For the Commission wants to have a development of HPC in Europe record, the greatest alloca- Europe that is competitive on being now fully recognized, how tion of hours in the world was the international stage, and is it positioned with respect to made by PRACE on the Ger- we would hope that the Ho- the ETP4HPC? man machine Hermit for an rizon2020 program will yield English team from the Met Of- the results hoped for, even if, The policy of the European fice: no less than 144 million considering the €15 million al- Commission concerning HPC hours were granted to enable located in the 2014-2015 Work is clear. It is a policy based this team to gain three years’ Programme, the financial re- on three pillars: a pillar of re- lead in the development of sources are not realistically search infrastructure, an R&D high resolution climate mod- adequate today. pillar to build technologies ‘‘

HighPerformanceComputing Americas | MARCH 2014 47

PRACE will offer 50 Pflop/s machines by 2020, with complementary architectures to allow for new HPC uses...

and a third pillar concerned which there should be great- in which new emerging coun- with applications. ETP4HPC is er funding. The objective is to tries such as China, Russia and the technological pillar, PRACE have a clear vision by the end India are poised to catch up, the infrastructure pillar, and of the year. or who have already caught the Centers of excellence, still up, with the leaders. under construction, the appli- Do you have a doubt about the cations pillar. availability of these funds? With respect to PRACE, thanks to the JUQUEEN machine, PRACE has just articulated its Caution is needed, but it’s Europe has now reached 7th strategic vision through the now clear that PRACE is a suc- place in the Top 500 ranking. year 2020. Could you present its cess. However, based on the In terms of performance, 80% main points and specific objec- TCO of a 1 Pflops machine, the of the European public infra- tives for us? cost of 50 Pflops of computing structures are in PRACE. Eu- power is not of the same or- rope is following the same tra- As we just stated, PRACE has der of magnitude. Given these jectory in the increase in com- guaranteed funding until mid- facts, we must come up with a puting power, but at a level 2015. We have now defined new a new financial model for substantially below that of the the next phase with the aim PRACE. United States. Europe must of meeting the needs of users. act as a "continent" and not Written by the Scientific Steer- What are the challenges that follow a national approach, ing Committee on the basis of Europe must now face to im- and it needs additional financ- the Scientific Case that iden- prove - or even maintain - its ing to reach that higher level. tifies these needs (among positioning in high performance others), the PRACE Council computing at the international In the race for the Exascale, Eu- agreed to aim for overall com- level? rope may appear "fragmented" puting power on the order of compared to the United States 50 Pflops by 2020, with com- The European high perfor- or China, where computing cen- plementary architectures. We mance computing effort is a ters are already focusing on also want to develop different process that begins with am- achieving this objective by the uses for the machines and to bitions, talent, and rather ex- end of the decade. Through its facilitate code porting, a prac- ceptional skills in the areas Horizon 2020 Program, does tice that did not really exist in of use and software. Setting Europe aim to take the lead, or the previous PRACE model. up such a process is time- should it resign itself to follow consuming, and now the chal- its competitors? On the other hand, today the lenge is to make it thrive and Council is working on the fi- develop. HPC is a global race To the question "Is Europe a nancial model for PRACE’s fu- in which historic players such follower?" I will answer “no,” ture which, for its part, is still as the United States and Ja- if only because the largest under discussion - and for pan are well placed, but also allocation in the world was HighPerformanceComputing Americas | MARCH 2014 48

Curie at Bruyères le Châtel (TGCC - CEA). made not in the United States tation of our computing cen- PRACE collaborates with XSEDE, or in China but in Europe. We ters. Suffice it to say that the its equivalent in some sense in are among the front-runners first discussions concerning the United States. Could you tell with many global advances PRACE, which was created in us more about this collabora- achieved on European ma- 2010, date back to 2004-2005. tion? How is it implemented? chines, such as the DEUS astro- However, European high per- What are the objectives and ex- physics project to model the formance computing is up and pected results? structuring of the universe, running, and we are proud of carried out on the CURIE ma- this venture which has already PRACE and XSEDE launched chine. From this perspective, led to some quite exceptional this initial collaboration fo- Europe is not a follower. results. That is why the model cusing on development and begun and followed until now tools testing in order to evalu- Computing power is not real- in PRACE must be perpetuated ate and enable interoperabil- ly what’s at stake here. What in PRACE 3, including a more ity between our two research is at stake is the use that is integrated structure. infrastructures. A first call for made of it. Creating exaflop projects was launched in late machines is a very interesting Along the same lines, in your 2013, in response to which we project, but how many teams opinion, what are Europe’s received some very attractive will be able to use them? The strengths and weaknesses when proposals that we are current- challenge is really to develop it comes to HPC? ly evaluating. uses, to train more people, and to give young people the Europe has had very good sci- Does PRACE intend to set up urge to pursue technologies entists, know-how, and skills the same types of collaboration that people dream about. in the field of computing for with China or other countries quite a few years now. These that are major international PRACE must be seen as the ac- are considerable strengths. players? tual demonstration of the will However, funding is not at the of Europe’s leadership, or at same level as that allocated in It is of course necessary to talk least of its intent to place in the United States and China. with our Chinese and Japanese this race to the Exascale, most That said, funding is not the counterparts. For the time be- likely with all the relative slow- essential factor: we must also ing, these talks are still quite ness inherent in our political build Europe to provide great- informal. For example, we and and administrative organiza- er integration, efficiency, and the Japanese are currently tion, and with the fragmen- responsiveness. discussing ways in which we HighPerformanceComputing Americas | MARCH 2014 49

Fermi in Bologna (Cineca). could work together and the nologies for applications they the power of the SuperMUC scope of our collaboration. But consider strategic. They have machine will be doubled in remember that PRACE is still a ambitions and the means to early 2015 from 3.2 to 6.4 relatively young organization achieve them, both financial Pflops. The second German that has yet to determine its and intellectual. This exces- machine, Hermit, a CRAY XE6, processes and guidelines in sive caution you are talking will evolve into a CRAY XC30 4 this area. Being represented about is a political problem. Pflops system in late 2014 or by twenty-five countries, four What is the European policy early 2015. Regarding Spain’s of which have greater voting towards China in HPC – and in MareNostrum, its configura- rights, PRACE must obtain a other sectors? If HPC is con- tion will also evolve in 2015. consensus among its partners sidered as strategic in many As for Curie 2, whose launch on the proper policy to pur- industrial fields, it does neces- we are targeting for 2017, it is sue with China. Collaboration sarily call for a bit of caution... a little early for all of the tech- projects are underway, but nical details to be finalized. we are moving forward care- In conclusion, two years after fully. In particular, we must the installation of the CURIE But you know, given the inter- make sure that we are all at machine, what will be the next national context where super- the same level. "megasystem" placed at the dis- computers become obsolete posal of the scientific commu- so quickly, we have to do ev- Isn’t this relatively excessive nity? What will its features be, erything we can to promote caution with respect to the Ori- and when will it be available? high performance comput- ent regrettable? ing as a major component of At the national level, GENCI is Europe’s scientific , economic China has a very clear line of renewing the computational and social strategy. This is the action with a specific direc- capabilities of the CINES with philosophy of our commit- tion. They develop machines a public tender that is under- ment, both at the national and with their own processor tech- way. At the European level, continental levels.  Intelligent Rack PDUs High Power and Intelligence precisely designed for HPC racks.

Choose from a broad portfolio of high power Intelligent PDUs: XXHigh Power, capacities up to 55 kW and 100 Amps XXHigh Density, up to 54 outlets in a single PDU XXHighest ambient temperature (60 °C, 140 °F)

Leverage the industry’s smartest capabilities: XXPlug-and-play environmental sensors XXAccurate unit-level and outlet-level kWH metering XXWi-Fi or wired networking XXCircuit breaker metering and monitoring XXCustomizable to fit your HPC racks

Visit www.raritan.com/SmartPower to learn more and explore all your PDU options.

Blue, red, green, yellow, orange… and a whole lot more. Smart managers can choose from the industry’s most extensive PDU color palettes to simplify visual identification in their data centers.

© 2013 Raritan Inc. HighPerformanceComputing Americas | MARCH 2014 51

/state_of_the_art/state_of_the_art

SHIELDS TO MAXIMUM, Mr Scott! Researchers use TACC supercomputers to perform simulations of orbital debris impacts on spacecraft and fragment impacts on body armor. These simulations are evaluated at up to 10 kilometers per second to ensure that they accurately capture the dynamics of hypervelocity impacts and match real-world experiments conducted by NASA. AARON DUBROW

We know it’s out there, debris ball-sized (between 0.4 inch to mechanical engineering at The from 50 years of space explo- 4 inches or 1 to 10 centimeters). University of Texas at Austin, ration — aluminum, steel, ny- Sure, space is big, but when a who studies impact dynam- lon, even liquid sodium from piece of space junk strikes a ics both experimentally and Russian satellites — orbiting spacecraft, the collision occurs through numerical simulations. around the Earth and posing at a velocity of 3.1 to 9.3 mph (5 “Even if the impact is not on the a danger to manned and un- to 15 kilometers) per second— main heat shield, it may still ad- manned spacecraft. According roughly ten times faster than a versely affect the spacecraft. The to NASA, there are more than speeding bullet! thermal researchers take the re- 21,000 pieces of “space junk” sults of impact research and as- roughly the size of a baseball “If a spacecraft is hit by orbital sess the effect of a certain impact, (larger than 4 inches or 10 cen- debris, it may damage the ther- crater depth and volume on the timeters) in orbit, and about mal protection system,” warns survivability of a spacecraft dur- 500,000 pieces that are golf Eric Fahrenthold, professor of ing reentry.” HighPerformanceComputing Americas | MARCH 2014 52

Only some of the collisions that may occur in low earth orbit can be reproduced in the labo- ratory. To determine the poten- tial impact of fast-moving orbit- al debris on spacecraft — and to assist NASA in the design of shielding that can withstand hy- pervelocity impacts — Fahren- thold and his team developed a numerical algorithm that simulates the shock physics of orbital debris particles striking the layers of Kevlar, metal, and fiberglass that make up a space vehicle’s outer defenses.

Supercomputers enable re- searchers to investigate physi- cal phenomenon that cannot be duplicated in the laborato- ry, either because they are too large, small, dangerous — or in this case, too fast — to re- produce with current testing Fig. 1 - This high speed video frame depicts the perforation of a six-layer technology. Running hundreds Kevlar target (4 inches/10.1 cm in width) by a 0.44 caliber copper projectile of simulations on the Ranger, (experiment performed at Southwest Research Institute). Lonestar and Stampede ma- chines at the Texas Advanced Fig. 2 - View of an orbital debris hole made in the panel of Computing Center, Fahren- the SolarMax satellite. Credit: NASA, Orbital Debris Program Office thold and his students have assisted NASA in the develop- ment of ballistic limit curves that predict whether a shield will be perforated when hit by a projectile of a given size and speed. NASA uses ballistic limit curves in the design and risk analysis of current and future spacecraft.

Results from some of his group’s impact dynamics re- search were presented at the April 2013 meeting of the Amer- ican Institute for Aeronautics and Astronautics (AIAA), and have recently been published in Smart Materials and Structures and International Journal for Nu- HighPerformanceComputing Americas | MARCH 2014 53

Fig. 3 - This simulation models the perforation of a six-layer harness satin weave Kevlar target (4 in/10.1 cm in width) by a 0.44 caliber copper projectile. merical Methods in Engineering. uses light gas guns to launch The simulation framework that In the paper presented at the centimeter-size projectiles at Fahrenthold and his team de- AIAA conference, they showed speeds up to 10 kilometers per veloped employs a hybrid mod- in detail how different charac- second. The simulations are eling approach that captures teristics of a hypervelocity colli- evaluated in this speed regime both the fragmentation of the sion, such as the speed, impact to ensure that they accurately projectiles — their tendency angle and size of the debris, capture the dynamics of hyper- to break into small shards that could affect the depth of the velocity impacts. need to be caught — and the cavity produced in ceramic tile shock response of the target, thermal protection systems. Validated simulation meth- which is subjected to severe ods can then be used to esti- thermal and mechanical loads. The development of these mate impact damage at veloci- models is not just a shot in the ties outside the experimental “We validate our method in the dark. Fahrenthold’s simulations range, and also to investigate velocity regime where experi- have been tested exhaustively detailed physics that may be ments can be performed, then against real-world experiments difficult to capture using flash we run simulations at higher ve- conducted by NASA, which x-ray images of experiments. locities, to estimate what we think

Fig. 4 - This figure (provided by NASA Johnson Space Center) shows an example of a spacecraft shielding system, which provides both thermal insulation and orbital debris protection in low earth orbit. The back (0.08 in/2 mm) aluminum layer is the wall of the spacecraft; in front (separated by a standoff) are shielding layers composed of aluminum, fiber- glass, Teflon, and Mylar materials. HighPerformanceComputing Americas | MARCH 2014 54

Fig. 5 - Solid rocket motor (SRM) slag. Aluminum oxide slag is a byproduct of SRMs. Orbital SRMs used to boost satellites into higher orbits are potentially a significant source of centimeter sized orbital debris. This piece was recovered from a test firing of a Shuttle solid rocket booster. will happen at higher velocities,” nally developed to study im- 20 years. The model param- Fahrenthold explains. “There pacts on spacecraft worked eters used in the simulation, are certain things you can do in well for a completely different such as the material’s strength, simulation and certain things you application at lower velocities, flexibility, and thermal prop- can do in experiment. When they in part because some of the erties, are provided by experi- work together, that’s a big advan- same materials used on space- mentalists. The supercomputer tage for the design engineer.” craft for orbital debris protec- simulations then replicate the tion, such as Kevlar, are also physics of projectile impact and Back on land, Fahrenthold and used in body armor. yarn fracture, and capture the graduate student Moss Shimek complex interaction of the mul- extended this hybrid method According to Fahrenthold, this tiple layers of a fabric protec- in order to study the impact of method offers a fundamentally tion system — some fragments projectiles on body armor ma- new way of simulating fabric getting caught in the mesh of terials in research supported impacts, which have been mod- yarns, others breaking through by the Office of Naval Research. eled with conventional finite el- the layers and perforating the The numerical technique origi- ement methods for more than barrier. “Using a hybrid tech- HighPerformanceComputing Americas | MARCH 2014 55

nique for fabric modeling works well,” Fahrenthold says. “When the fabric barrier is hit at very high velocities, as in spacecraft shielding, it’s a shock-type im- pact and the thermal properties are important as well as the me- chanical ones.”

Moss Shimek’s dissertation re- search added a new wrinkle to the fabric model by represent- ing the various weaves used in the manufacture of Kevlar and ultra-high molecular weight polyethylene (another leading protective material) barriers, including harness-satin, basket, and twill weaves. Each weave type has advantages and dis- Fig. 6 - This figure shows a computational model of the spacecraft shielding advantages when used in body and wall depicted above. The wall of the spacecraft is shown in red. The shield- armor designed to protect ing is modeled as five layers: aluminum (in yellow), multilayer insulation (in military and police personnel. green), and Beta-cloth plus aluminized Mylar (in blue). The model reproduces Layering the different weaves, exactly the correct material areal densities, but the layer count is reduced, in many believe, can provide im- order to reduce computational cost. Simulation results show the impact of a 0.94 in/0.24 cm diameter aluminum projectile on the spacecraft shield and proved protection. wall. Impact velocity is 5.59 miles per second (9 km/s) and the impact obliquity is 30 degrees. The simulation indicates that this projectile will pass through the Fahrenthold and Shimek (cur- shielding system and perforate the wall of the spacecraft. rently a post-doctoral research associate at Los Alamos Na- ing that their simulation results ers in a typical fabric barrier? tional Laboratory) explored the were within 15 percent of the Can simulation assist the de- performance of various weave experimental outcomes. sign engineer in developing or- types using both experiments bital debris shields that better and simulations. In the No- “Future body armor designs may protect spacecraft? The range vember 2012 issue of the AIAA vary the weave type through of engineering questions is Journal, they showed that in a Kevlar stack,” Shimek said. endless, and computer simula- some cases the weave type of “Maybe one weave type is better tions can play an important role the fabric material can greatly at dealing with small fragments, in the ‘faster, better, cheaper’ influence fabric barrier perfor- while others perform better for development of improved im- mance. According to Shimek, larger fragments. Our results sug- pact protection systems. “We “Currently, body armor normally gest that you can use simulation are trying to make fundamental uses the plain weave, but research to assist the designer in develop- improvements in numerical al- has shown that different weaves ing a fragment barrier which can gorithms, and validate those al- that are more flexible might be capitalize on those differences.” gorithms against experiment,” better, for example in extremity Fahrenthold concludes. “This protection.” Shimek and Fahr- What can researchers learn can provide improved tools for enthold used the same numer- about the layer-to-layer impact engineering design, and allow ical method employed for the response of a fabric barrier simulation-based research to NASA simulations to model a through simulation? Can body contribute in areas where experi- series of experiments on lay- armor be improved by varying ments are very difficult to do or ered Kevlar materials, show- the weave type of the many lay- very expensive.”  1 & 2 juillet/july 2014 Le rendez-vous international École Polytechnique HPC & SIMULATION Palaiseau - France The International Meeting

Forum 2014

SIMULER POUR INNOVER INNOVATION BY SIMULATION

Inscription/Registration www.teratec.eu

Platinum Sponsors Gold Sponsors

Silver Sponsors

E 13262 PubForumTeratec2014-A4.indd 1 04/03/14 14:27 HighPerformanceComputing Americas | MARCH 2014 57 /success-stories Image © TractorByNet.com From the farm to the home: DEM improves the world Discrete Element Modeling (DEM) using Lagrangian Multiphase models has been applied to a multitude of industrial applications. With recent advances in the simulation of discrete particles and computing power, it is now being applied to the off-road vehicle sector. TITUS SGRO*

Powerful computing hardware applications like farming equip- While fluid flow has been a has allowed automotive com- ment and construction vehicles centerpiece of the Computer panies to expand upon their as well as household items Aided Engineering (CAE) and design simulations, reaching such as lawn mowers and snow Computational Fluid Dynam- out from the “core” optimiza- blowers. Recent technological ics (CFD) for decades, being tion areas of external aerody- advances in the simulation able to predict the motion of namics and engine design to world allow this machinery to individual particles has been improve upon and perfect vir- be modeled using discrete el- an elusive utopia. The automo- tually every aspect of an auto- ements to replicate small par- tive sector has been the most mobile. One particular group of ticles, bypassing approxima- demanding of these kinds of vehicles that has lagged behind tions and guesswork and going applications, seeking to under- traditionally but is now racing straight to a high-fidelity ex- stand and improve the com- to catch up is off-road vehicles. amination of the true operat- This category includes, in addi- ing conditions of these vehicles * Technical Marketing Engineer tion to ATVs, massive industrial - exactly like modern cars. CD-adapco HighPerformanceComputing Americas | MARCH 2014 58

Fig. 1 - Simulation of airflow velocities within the mower deck. plex processes going on within mower), gravel and dirt (for • Other areas of the Ground agricultural, construction and construction equipment), and Transportation Industry: rain, domestic vehicles. Agricultur- grain (for farming equipment). mud, or snow modeling al and construction vehicles, Simulating these particles as a (tractors, lawn mowers, front fluid continuum creates imper- A host of other industries have loaders, etc.) that are used to fect results due to the simplifi- already begun using DEM, to pick up and move enormous cation. simulate spray particles or at- amounts of tiny particles, can omizers – very common in the be affected by airflow and mo- Applications of DEM food industry, painting with tor mechanics, among many aerosol cans or industrial appli- other physical mechanics. With DEM using Lagrangian Multi- cations for the automotive and recent advances in the simula- phase has already been applied tion of discrete particles and to a myriad of applications, Fig. 2 - Underside view of volume mesh of lawn mower highlighting computing power, engineers from icing on airplane wings to mowing deck and blades. are finally able to use CAE to mud and dirt simulation. Now improve and optimize these this powerful technology is be- vehicles by modeling each par- ing applied to the off-road ve- ticle individually to create a hicle sector, including: much more faithful simulation of these complex machines. • Construction vehicles: dump trucks, excavators, and Contaminant particles such loaders as dust, as well as particulates that these objects are designed • Agricultural equipment: to work with, have previous- tractors and harvesters ly been simulated in various methods as some form of fluid • Domestic machinery: continuum. These particles in- lawn mowers, snow blowers, clude blades of grass (for a lawn and all-terrain vehicles HighPerformanceComputing Americas | MARCH 2014 59

Fig. 3 - Underside view of grass particles in motion within the lawn mower deck. aerospace industries – as well at them. The blades were set simulation pictured, the two as the Life Sciences industry, in their own region with a sep- meter long blades are spinning investigating the dispersal of arate mesh and spun using at 1,500 rpm, which translates medication in a patient’s blood CD-adapco STAR-CCM+’s Rigid into a time-step of 1.11x10-4s. from a pill or in the lungs from Body Motion (RBM), which ro- The simulation was set at 20 in- an inhaler. tates the entire region around ner iterations per time-step to a set axis. Because of the limi- ensure convergence. The par- Two blade tractor lawn mower tations of RBM, CD-adapco ticle injector field consisted of a recommends that during each 15x15 rectangular grid of injec- The following example uses a time-step of the rotation, the tors and each injector was set dual bladed tractor lawn mow- spinning region should be to inject 100 particles per sec- er, with the blades positioned turned by 1 degree at most so ond. The simulation ran for a under the driver, as seen in Fig. that the rotating cells do not bit over 0.5s of simulation time, 2. A plate protects the driver become completely out of line meaning that almost 13,000 from the grass being propelled with the stationary cells. In the particles were injected.

Lagrangian multiphase

The Lagrangian-Eulerian ap- phase, involves either a sec- features within the simulation proach is based on "a statisti- ond fluid dispersed in relatively boundaries. This is the most cal description of the dispersed small pockets or small particu- faithful representation of parti- phase in terms of a stochastic lates of a solid that is mixed in cles possible within a fluid sim- point process that is coupled with the liquid. ulation, allowing for the exami- with an Eulerian statistical rep- nation of primary and second- resentation of the carrier fluid The Lagrangian Multiphase ary breakup models, as well as phase." In other words, the model makes use of DEM par- turbulent dispersion. However, main fluid phase is solved as ticles and simulates each par- as one can imagine, it also re- a continuum using the time- ticle individually as it is affected quires a lot of processor pow- averaged Navier-Stokes equa- by the fluid domain and colli- er, especially when the number tions. The second, discretized sions with other particles and of particles exceeds 105. HighPerformanceComputing Americas | MARCH 2014 60

Fig. 4 - Grass particles in motion within the mower deck.

Certain assumptions and ap- besides the ejector. This barri- as well as areas that were af- proximations were used within er was set as a porous region, fected by internal vortices. Ev- the simulation. The most im- allowing the fluid (air) to flow ery particle was permitted to portant assumption was that freely but not allowing the par- exit only through the ejector the tractor lawn mower did ticles to cross the barrier. (in all of the figures, the ejec- not actually move within the tor is positioned on the left- simulation. The wheel rota- The third major approxima- hand side). The particles were tion, translation and creation tion was the shape of the grass modeled with drag and their of a real grass field for the lawn particles. Every grass particle density was set to a value such mower to drive over being be- was modeled as a composite that each particle would be pri- yond the reasonable scope of of spherical particles in a 1 cm marily affected by the motion the simulation, the movement long conical shape. The com- caused by the spinning blades. was simulated by having the posite particle was forbidden grass particles constantly in- from being broken. To keep it As with the previous case study, jected into the cutting areas. simple and inexpensive, only the particles were colored with one particle model was used. respect to their velocity magni- Since the full tractor body was Additionally, the particles were tude. The benefit of this simu- not included in the simulation, injected into the region already lation was demonstrating that it did not need to be meshed at separated from the ground; the a complex physical simulation all, allowing for finer refinement lawn mower blades were not in- of a lawn mower can be accu- of the blades, protector plate, volved in separating them from rately modeled, allowing for de- and air volume in the area of the injector in any way and only signers to see the movement of question. Another approxima- provide air motion. each individual blade of grass. tion was that the grass external This permits the designers to to the plate was not physically Because of the small time-step modify their design based on present to impede the motion (caused by the 1 degree rota- complexities too subtle to be of the moving particles. Hence, tion mentioned above), a de- seen with the naked eye, modi- an invisible barrier was setup tailed picture of the motion of fications which can significantly around the perimeter of the the particles was created. This increase their efficiency by re- plate to the ground to prevent enabled the visualization of ar- ducing fuel consumption and the leakage of particles in areas eas where clumping occurred, the time required to mow.  Be locked or be free

AMDAMD FireProTMTM S-SeriesS-Series solutions solutions are are the the very very best best in in open-standard open-standard GPU-acceleration GPU acceleration with with OpenCL OpenCLTM 2.1;TM 1.2. Programmersprogrammers can maximize theirtheir investment whenwhen developingdeveloping codecode by targeting bothmulti-core multi-core CPUs, theCPUs, lastest the latest APUs APUs and discrete and discrete GPUs, GPUs freeing freeing themselves themselves from from proprietary proprietary technologies. technologies.

FindFind outout more at@ www.fireprographics.comwww.amd.com

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD logo, FirePro and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is a trademark of Apple, Inc. used with permission by the Khronos Group. Other names are used for identification purposes only and may be trademarks of their respective owners. See www.amd.com/firepro for details. HighPerformanceComputing Americas | MARCH 2014 62 /hpc labs

AN INTRODUCTION TO

PERFORMANCE PROGRAMMING [ part II ] While it is always difficult to optimize existing codes, an appropriate methodology can pro- vide provable and replicable results. Last month, we saw that choosing the right test cases was essential to validate and evaluate optimizations, we then concentrate on efficiently us- ing the compiler to optimize and to vectorize. This month, let’s see how data layouts, algo- rithms implementation as well as parallelization impact performance... ROMAIN DOLBEAU*

As previously mentioned, a expression “bandwidth-bound I - DATA LAYOUTS minimal understanding of the algorithm” is dreaded by many hardware is required to exploit of those tasked with improv- One of the main issues with it. Efficiently utilizing the com- ing performance, as it means memory bandwidth is the putational resources of the that the available bandwidth amount wasted by poor data CPU is important, but since the from the main memory to the layouts. This issue is not new, term “memory wall” was coined CPU cores is the limiting factor. and was already targeted by pa- by Wulf and McKee [1], it has But even without changing the pers such as the work of Truong been known that memory ac- algorithm, it is sometimes pos- et al. [2]. But it still seems to be cesses were the limiting factor sible to drastically improve per- for many kinds of codes. The formance. * HPC Fellow, CAPS Entreprise HighPerformanceComputing Americas | MARCH 2014 63

largely unknown to most devel- I.A - Arrays of structures vs. such spatial locality, caches opers. We first have to go back structures of arrays have a granularity of a cache to computer architecture (and line, a contiguous amount of Hennessy and Patterson [3]). Caches do not retain data at memory. Common sizes are The main technique for a CPU single element granularity, 32 or 64 bytes of well-aligned to avoid the hundreds of cycles e.g. a single double-precision memory (i.e. the address of required for data to arrive from value. While this would help the first byte is an integer main memory are caches, tiny with temporal locality (the fact multiple of the cache line size). but fast memories close to the that a recently used element Whenever a code requires an execution cores. While their might soon be reused), it element from memory, the existence is known, it seems would do nothing for spatial entire cache line is loaded from that most programmers are locality (the fact that it’s likely the main memory and retained unaware of their behavior and the next useful element will in the cache. If a neighboring write code that greatly impedes be a memory neighbor to a element from the same cache their efficiency. recently used element). To gain line is subsequently needed, the access will be a cache hit, and therefore very fast. This is a References well known mechanism, taught in most computer science [1] W. A. Wulf and S. A. McKee, Hitting the memory wall: classes. implications of the obvious, SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20–24, Mar. 1995. But the implications are not [2] D. Truong, F. Bodin, and A. Seznec, Improving cache always well understood. If a behavior of dynamically allocated data structures, in Parallel cache line is 64 bytes, then Architectures and Compilation Techniques, 1998. every memory transaction will Proceedings. 1998 International Conference on, 1998, pp. involve the whole 64 bytes, 322–329. no matter how few bytes are actually needed by the code. If [3] J. L. Hennessy and D. A. Patterson, Computer only a single double-precision Architecture: A Quantitative Approach. San Mateo, CA: element (8 bytes) is needed Morgan Kaufmann, 1990. from each cache line, 87.5% of the memory bandwidth [4] A. V. Aho and J. E. Hopcroft, The Design and Analysis of is wasted on unused bytes. Computer Algorithms, 1st ed. Boston, MA, USA: Addison- Therefore, it is very important Wesley Longman Publishing Co., Inc., 1974. to ensure that as few elements as possible in each loaded [5] K. E. Atkinson, An Introduction to Numerical Analysis. cache line are not used. New York: Wiley, 1978. Unfortunately, some extremely common programming tech- [6] S. Arora and B. Barak, Computational Complexity: A niques goes against that Modern Approach, 1st ed. New York, NY, USA: Cambridge principle. The most obvious University Press, 2009. offenders are structures (and of course objects, which are [7] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. generally structures with McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, associated functions). and D. Sorensen, Lapack: a portable linear algebra library for high-performance computers, in Proceedings of the While there is a lot to be said in 1990 ACM/IEEE conference on Supercomputing, ser. favor of structures, that is not Supercomputing ’90. Los Alamitos, CA, USA: IEEE Computer our subject; therefore we’ll look Society Press, 1990, pp. 2–11. ... at the downside, the way they /... can sometimes waste memory HighPerformanceComputing Americas | MARCH 2014 64

bandwidth. The content of a occurrences of that concept foreach particle in well-defined structure is a lot will be allocated in what is de- charged_particles of information related to an scribed as an Array of Struc- update_velocity(particle, abstract concept in the code - it tures. Subsequently, functions electric_field) could be a particle used in fluid in the code will go through all simulation, the current state of occurrences in a loop to pro- Presumably, the function a point in a discretized space, cess the data, such as this: update_velocity will change or even an entire car. All the the speed of the charged particle in the electric field. But while the speed might be References (follow up) defined with only a few values (e.g. the current velocity in X, [8] M. Frigo and S. G. Johnson, The design and implementation Y and Z), the structure itself is of FFTW3, Proceedings of the IEEE, vol. 93, no. 2, pp. likely to contain much more 216–231, 2005, special issue on “Program Generation, information that’s irrelevant Optimization, and Platform Adaptation”. to that particular step of [9] M. Frigo, A fast Fourier transform compiler, in Proc. 1999 computations. But as structures ACM SIGPLAN Conf. on Programming Language Design are contiguous in memory, and Implementation, vol. 34, no. 5. ACM, May 1999, pp. much of that information will 169–180. be loaded as they belong to the same cache line. [10] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. San Francisco, CA, USA: Morgan Kaufmann Let’s assume each particle in Publishers Inc., 2008. our example is defined by 16 double-precision values, i.e. [11] T. Mattson, B. Sanders, and B. Massingill, Patterns 128 bytes. The structure would for parallel programming, 1st ed. Addison-Wesley occupy 2 full cache lines all by Professional, 2004. itself. Let’s also assume all 3 velocities are in the same half [12] B. Chapman, G. Jost, and R. v. d. Pas, Using OpenMP: of the structure, i.e. in the same Portable Shared Memory Parallel Programming cache line. Each time we need (Scientific and Engineering Computation). The MIT Press, to update a velocity, the CPU 2007. will:

[13] W. Gropp, E. Lusk, and A. Skjellum, Using MPI (2nd 1 - Load the cache line (64 bytes) ed.): portable parallel programming with the message- containing the three velocities passing interface. Cambridge, MA, USA: MIT Press, 1999. (24 bytes); [14] A. S. Tanenbaum, Modern operating systems. 2 - Perform any required com- Englewood Cliffs, N.J.: Prentice Hall, 1992. putations, and update the val- [15] W. J. Bolosky and M. L. Scott, False sharing and its effect ue in the cache; on shared memory performance, in USENIX Systems on USENIX Experiences with Distributed and Multiprocessor 3 - Eventually, when space in Systems - Volume 4, ser. Sedms’93. Berkeley, CA, USA: the cache is needed, the whole USENIX Association, 1993, pp. 3–3. cache line (64 bytes) will be flushed to memory. [16] D. E. Culler, A. Gupta, and J. P. Singh, Parallel Computer Architecture: A Hardware/Software Approach, 1st ed. 62.5% of the memory band- San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., width required to load and 1997. store the particle is wasted by unnecessary data. If the three HighPerformanceComputing Americas | MARCH 2014 65

velocities were spanning both Listing 1 - Example of array of pointers. halves of the structure, then the bandwidth requirement would void matrix_sum_aop(int n, double** A, double** B) { double for the same amount of for (int i = 0 ; i < n ; i++) { useful data - wasting 81.25%. for (int j = 0 ; j < n ; j++) { A[i][j] += B[i][j]; } One solution for this issue is } called structures of arrays. As } the name implies, the idea is to inverse the relationship by first allocating all atoms of data Listing 2 - Example of multidimensional arrays. in arrays, and then grouping them into a structure. Instead void matrix_sum_mda(int n, double A[n][n], double B[n][n]) { for (int i = 0 ; i < n ; i++) { of having an array of particles, for (int j = 0 ; j < n ; j++) { each with its velocities, the A[i][j] += B[i][j]; code would use three arrays of } velocities. Each array would be } contained as a single velocity } in X, Y or Z for all particles. The result for the function update_ velocity above is that we would Listing 3 - Example of explicitly linearized arrays. need three cache lines instead void matrix_sum_linear(int n, double* A, double* B) { of one: one for each of the X, Y for (int i = 0 ; i < n ; i++) { and Z velocities. But that is not for (int j = 0 ; j < n ; j++) { a bad thing. The first particle A[i * n + j] += B[i * n + j]; would require 192 bytes of } } bandwidth, loading 8 elements } of each array. However, the next 7 particles would require no bandwidth at all - their data A lot of code utilizes not multi- to double. Then this pointer is had been prefetched by the dimensional arrays but array used to retrieve A[i][j] i.e. the first particle, thanks to spatial of pointers, adding a useless data itself. This is in effect an locality. The average bandwidth intermediate load. Listing 1 indirect access. per particle is therefore an illustrates this form of inefficient optimal 24 bytes per particle, code. The data is stored behind An efficient way can be exactly with a greatly reduced latency a pointer to a pointer (hence the same in the body, by utiliz- because of locality. If that the two stars). ing a properly typed pointer function was purely bandwidth- to the data. This is illustrated bound, we would have gained a The body of the code looks good in listing 2. By specifying the factor of x2.66 by reorganizing to the untrained eye: it has a dimensions of the array in the data in a cache-friendly manner. nice pair of square brackets to parameter list, it becomes pos- access the element of the two sible to use the clean multi- I.B - Multidimensional array dimensional data, as is done in dimensional notation of the vs array of pointers Fortran. But the performance C language - which unfortu- will indeed be suboptimal. nately looks exactly the same This is an issue that doesn’t The real meaning of the code as a pointer-to-pointer double exist in Fortran, where multi- includes not one, but two dereference. This new version dimensional arrays are the chained loads to access the will simply compute the linear norm. But in the C language data. The machine must first access i * n + j, and do a single family the issue is pervasive. evaluate A[i], itself a pointer lookup in memory to access the HighPerformanceComputing Americas | MARCH 2014 66

data. Of course, the data alloca- stability, approximation, the are not well suited to the over- tion is also different: instead of boundary conditions, and head of a specialized library. first allocating the array of the whatever other requirements For instance, large matrix mul- pointer, followed by a loop al- might exist for solving the tiplications should make use of locating all the pointers to dou- underlying problem. This is an optimized function such as ble, a single allocation of the something that must be done dgemm. Small constant size entire data set is used. The ex- with involvement someone matrix multiplications (such as act same allocation that would from the scientific field. More 3×3 matrices used in Euclidean be used for the ugly, explicitly often than not, the constraints space rotation) should be kept linearized version illustrated in of numerical algorithms are as small hand-written func- listing 3 that most program- such that they unfortunately tions, potentially amenable to mers justifiably try to avoid. cannot be changed. aggressive inlining.

II - ALGORITHMS What remains to be studied In a similar way, whereas is the implementation of such large FFT can take advantage The word algorithm can cover algorithms, which quite of- of FFTW3 itself or its MKL different types of problems ten are built on top of other counterpart, small FFT of a depending upon the audience. algorithms - those of the com- given size can sometimes be Computer scientists will think puter science variety. Matrix in- readily implemented with in terms of basic programming version, matrix multiplication, the FFTW codelet generator techniques as explained by finding a maximum value, fast described in [9]. A more work- Aho et al. [4]. Other scientists Fourier transforms and so on intensive optimization is the in need of high performance are the building blocks of many specialization of algorithms computing will likely think more numerical algorithms. They are for constant input data. For in terms of numerical analysis quite amenable to improve- instance, if the small 3×3 matrix as introduced by Atkinson [5]. ments, substitution and well- mentioned earlier is a rotation studied optimizations. around the main axis of a 3D In either cases, the idea is to space, the presence of zeroes achieve the desired results with One of the primary tasks of the and one in the matrix can be minimal complexity, that is, optimizer in a numerical code hard-coded in the function, with a minimal increase in work will be to identify these types removing multiple useless when augmenting the problem of conventional algorithms, operations. size. Complexity is usually and make sure the implemen- taught in most computer tation used is the best one for III - PARALLEL CODES science courses, and numerous the code. This is usually not a recommendable books have lot of work, as vendor-supplied Parallelization is one of the been written on the subject libraries will include most of richest and most complex sub- such as Arora et al [6]. them. For instance, the Intel jects of computer science, and Math Kernel Library (accessi- is out of the scope of this pa- It is usually quite difficult for a ble in recent Intel compilers by per. Many introductory books computer scientist to replace simply calling the -mkl option have been written on the sub- a numerical algorithm chosen in the compiler and the linker) ject among which for instance to solve a particular problem includes full implementations Herlihy & Shavit [10], Matt- by a “better” one. This requires of BLAS or LAPACK [7], fast son, Sanders & Massingill [11], understanding what the code Fourier transforms including a Chapman, Jost & Pas [12] and is trying to achieve in computer FFTW3 [8] compatible interface, Gropp, Lusk & Skjellum [13]. terms but also in physical and so on. One important as- This section deals with running (or chemical, astronomical...) pect when using such libraries parallel code on contemporary terms as well. It also entails are their domain of efficiency: machines without falling into understanding the numerical extremely small problem sizes common pitfalls. HighPerformanceComputing Americas | MARCH 2014 67

Dual X5680, 2 MiB pages, sequential

0/0 1/0 nanoseconds 0/1 1/1

Array size (MiB)

Figure 1 - Latency to walk through an array of varying size, form the lat_mem_rd benchmark.

III.A - NUMA cies. Such access is therefore ns represents an approximate- slower than accessing a mem- ly 63% increase in latency when All recent multi-socket x86 64 ory address located in a locally accessing remote memory. systems are NUMA (Non-Uni- connected memory chip. This form Memory Access), i.e. the effect is illustrated in figure 1, III.B - NUMA on latency of a load instruction de- which plots the average latency pends on the address accessed of memory access when step- To avoid these kinds of by the load. The common im- ping through an array of vary- performance issues, it is im- plementation in all AMD Op- ing size on a dual socket Xeon portant to understand how terons and all Intel Xeons since X5680 Westmere-EP. The first memory is allocated (virtually the 55xx (i.e. Nehalem-EP fami- three horizontal plateaus are and physically) by the operating ly) is to have one or more mem- measurements where the ar- system, to ensure memory ory controller(s) in each socket, ray fits in one of the three lev- pages (the concept of paging and to connect each socket via els of caches, while the much is among those covered by a cache coherent dedicated higher levels at over 12 MiB Tanenbaum [14]) are physically network. In AMD systems the measure the time to access the located close to the CPU that links are HyperTransport, while main memory. The four lines will use them. The Linux page for Xeons they are QPI (Quick- represent the four possible re- size will be 4 KiB or 2 MiB; the Path Interconnect). Whenever lations between the computa- last one is more efficient for a CPU must access a memory tion and the memory: 0/0 and TLB (again [3]) but is not yet address located in a memo- 1/1 represent access to locally perfectly supported. This page ry chip connected to another attached memory, while 0/1 size will be the granularity socket, the request and result and 1/0 represent access to the at which the memory can be have to go through the net- other socket’s memory. Going placed in the available physical work, adding additional laten- from about 65 ns to about 106 memories. HighPerformanceComputing Americas | MARCH 2014 68

In Linux, this placement is made rectives), then half the threads Reading data from a file, re- by specifying in which “NUMA will exclusively use remote ceiving data from the network, node” the physical page should memory, and performance will explicit non-parallel zeroing reside. The default behavior suffer accordingly. or initialization of the allocat- is only partially satisfactory. ed memory will all lead to this At the time the code requires Depending on the kind of work- single-node allocation. Forcing allocation of memory, virtual load, there are some simple interleaving as above usually space is reserved but no heuristics to limit the nefarious helps (by utilizing all available physical page is allocated. The effects of NUMA: bandwidth), but is not optimal first time an address inside (it forces the averaging of la- the page is written, the page • Single process, single : tency rather than minimizing will be allocated on the same the default behavior will only it); it still should be tried first, NUMA node that generated use the local NUMA node as ex- as it is very easy to implement. the write (if that memory is plained. This might be a good The ideal solution is to be able full, the system will fall back to thing (it optimizes the latency) to place each page on the node allocating on other nodes). This or a bad thing (it cuts the avail- whose threads will make the is a very important aspect. If able memory bandwidth in most use of it, but that is not a for one reason or another the half). Some code might benefit simple task. memory is “touched” (written from interleaving pages on all to) by only one thread, then all nodes, either by prefixing the • Multiple processes, with mul- physical pages will live in the command with numactl tiple threads: although it might NUMA node where the thread --interleave=all, or by replac- seem the most complicated was executing at the time. ing the allocation function such case, it usually isn’t: the typi- as malloc by a NUMA-specific cal placement of one process Of course, if the advice from function such as numa_alloc_ in each socket, with as many first part was followed, then interleaved. threads as there are cores in the thread has been properly the socket leads to an optimal “pinned” on the CPU and will • Multiple processes, each sin- placement by the default be- not move. This means than gle thread: this is the norm for havior. No matter which thread any subsequent access by this MPI-based code that does not allocates the memory, all the same thread will benefit from include multithreading. The threads of the same process the minimal latency. That’s the default behavior of Linux (pro- will see optimal latency. good aspect of the behavior. vided the process have been properly “pinned” to their CPU) III.C - False sharing One of the consequences is is excellent, as all processes that in a two-socket system will have low latency memory, False sharing is a long- (two NUMA nodes), only half yet the multiplicity of processes standing problem in a cache the memory will be used by will ensure the use of all mem- coherent multicore machine the single-thread process (un- ory controllers and available (see Bolosky and Scott [15]). less consuming a large amount bandwidth. Modern CPUs generally use of memory). It also means that invalidation-based coherency only half the memory band- • Single process, multiple protocol (relevant details about width from the entire system threads: this is the norm for it can be found in Culler et al. can be exploited, leaving the OpenMP-based code. The de- [16]), whereas every write to a second memory controller in fault behavior is usually quite memory address requires the the second socket unused. An- bad; more often than not, the corresponding cache line to be other consequence is that if the code will cause the allocation present in the local CPU cache code subsequently becomes of physical pages on the master with exclusive write access- multi-threaded (for instance thread’s node, despite parallel that is, no other cache holds a through the use of OpenMP di- sections running on all CPUs. valid copy. HighPerformanceComputing Americas | MARCH 2014 69

“True sharing” occurs when thus avoiding the problem. a single data element is used A synthetic benchmark using Listing 4 - False sharing. simultaneously by more than this structure running two one CPU. It requires careful threads on two different typedef struct { int neg; synchronization to ensure sockets in a dual X5680 system #ifdef NOCONFLICT respect of the semantic, and would see a increase in time int pad[15]; incurs a performance penalty of 50% as opposed to when #endif as each update requires the padding was not in use. int pos; the invalidation of the other Hardware counters show a } counters; CPU’s copy. Back-and-forth very large number of cache line can become very costly but is replacement in the modified unavoidable if required by the state when padding is not used, at the cost of a slightly less ef- algorithm (unless, of course, versus a negligible number ficient load balancing between one changes the algorithm). when it is (cache coherency threads. and the modified state are well “False sharing” occurs when explained by Culler et al. [16]). CONCLUSION two different data elements are used by two different CPUs, but For large arrays that are up- Here is the end of this series of happen to reside in the same dated in parallel by multiple two articles that introduces a cache line. There is no need threads, the data distribution few key points for the newcomer for synchronization. However, among threads will be an im- to keep in mind when trying because the granularity of portant factor. If using a round- to improve the performance invalidation is the whole cache robin distribution (i.e. thread of code. To summarize, it is line, each data update by a CPU number n out of N total threads important to be able to justify will invalidate the other CPU’s updates all elements of indices the validity of the work done, copy. This requires costly back- m such as n --- m (mod N)), in terms of reliability, speed and-forth of the cache line. This then false sharing will occur on and the choice made when can greatly affect performance most updates - not good. But if improving the code. Other and is completely unnecessary: a block distribution was used, points are to properly exploit the if the two data elements lived where each thread accesses tools at hand before modifying in different cache lines, no 1/N of the elements in a single the code, to understand the invalidation would occur for continuous block, then false relation between the data either, thus ensuring maximum sharing only occurs on cache structures and the performance performance. lines at the boundaries of each of the code, avoid re-inventing block - at most one cache line the wheel and exploit pre- For very small data structures will have shared data between existing high-performance (such as a per-thread counter), two threads. Not only will the implementations of algorithms, this can be a huge problem number of unnecessary invali- and finally run parallel code as each update will lead to an dations be small relative to the in a manner adapted to the unnecessary invalidation. For total number of accesses, but underlying hardware. instance, listing 4 describes they will usually be masked by a structure of two elements: the fact that one thread will Hopefully, this will help pro- neg and pos. Without the update at the beginning of the grammers and scientists alike intermediate padding array pad, computation, and the other to understand some key as- those two values would share a one at the end, thus minimiz- pects of performance and ob- cache line, and when used by ing the effect. And to ensure tain higher performing code two different threads, would conflict-free accesses, the block without an excessive amount cause a major performance size that each thread handles of work. issue. The padding ensures that can be rounded to a integer they are in different cache lines, multiple of the cache line size, Happy programming!  HighPerformanceComputing Americas | MARCH 2014 70

/hpc_labs Discovering OpenSPL, The Spatial Programming Language Behind OpenSPL is the factual concern that the temporal computation model of multicore processors reaches its limits in terms of scalability, efficiency, and complexity. This requires radical innovations in the design of systems destined for scale-up. What does spatial mean here? How OpenSPL works and what is the future of this new open programming language? Let’s have a deep view on these... osKar mencer1, Michael j. flynn2, John p. shen3

OpenSPL is a novel initiative by used in practice for a very long pensive that it has to be able to CME Group, Chevron, Juniper time. In essence, making an execute all possible programs and Maxeler Technologies to in- Application Specific Integrated and applications. Computing crease awareness and accept- Circuit (ASIC) is just like comput- in space, on the other hand, ability of computing in space ing in space where writing the brings the silicon substrate to rather than in time sequence. program takes many hundreds of engineers and several years (1) CEO Maxeler Technologies. While the initiative is new, the of effort, even without consid- (2) Professor, Stanford University. concepts and ideas behind ering the enormous costs. As (3) Head of Nokia Research computing in space have been such, a single ASIC is now so ex- Center North America Lab. HighPerformanceComputing Americas | MARCH 2014 71

the trial-and-error program- mer and allows us to adapt the Listing 1 - An OpenSPL kernel spatial structure to the prob- using Java syntax. lem the substrate is solving. SCSVar x = io.input(“x”); SCSVar y = io.input(“y”); Computing in space could be SCSVar t1 = x - y; considered a generalization of SCSVar t2 = x + y; decades of research (includ- SCSVar r1 = t1 * t2; ing the authors’ contributions) SCSVar r2 = t1 / t2; towards systolic arrays, vector io.output (“r1”, r1); supercomputers, and a wide io.output (“r2”, r2); range of research projects such as Alan Huang’s MIT The- sis on Computational Origami [1]. Going further back, to the late 1950s, IBM revolutionized hardware design with its Stan- dard Modular System (SMS) us- ing standardized circuit cards that could be manufactured quicker and more reliably than the older custom approach. A key piece of SMS was Automat- ed Logic Design (ALD) sheets. The designer used a sheet with 2D grid of blocks; each block Fig. 1: OpenSPL split of computation into control flow and dataflow. specified a logic function (card) and a connection. Each con- Back to 2014, what does How does a spatial program nected block was entered on OpenSPL mean in practice? work? Easy: describe a directed a punched card, enabling the Rather than describing a graph with sources and sinks, design to be managed by com- thread of instructions and du- and connect sources and sinks puter. The "compiled" design plicating the thread onto mul- to memory and other I/O chan- checked the logic, updated sig- tiple processors to handle mul- nels. This can be done by using nal data and produced the wire tiple streams of data (SIMD), the syntax of any programming routing for manufacturing. OpenSPL requires a split of the language. Maybe we could computation into control flow have called OpenSPL comput- This automated design and and dataflow components, cre- ing with directed graphs, but manufacturing process lead to ating a network of arithmetic that seemed less elegant. The the market dominance of IBM units (see Figure 1) through listing 1 illustrates this. 1401 and 7090, among other which the data flows just like machines The SMS was a pre- the materials flow through a Here SCS stands for Spatial cursor to the SLT design system network of assembly lines in a Computing Substrate and hints used in System 360 and the age modern factory. Each arithme- at the purpose of the OpenSPL of the mainframe. OpenSPL- tic unit forwards the result of initiative. Rather than pretend based systems gain a similar the operation to the next arith- that a single program could advantage from employing the metic unit, eliminating the need optimally compile to different lessons of manufacturing (and for most register and memory architectures and substrates, early IBM systems) to using as- accesses. we set OpenSPL as a baseline sembly lines to build the results of computation. [1] The folding of circuits and systems, Applied Optics, 31-26, 1992. HighPerformanceComputing Americas | MARCH 2014 72

from which members of the ini- tiative can construct substrate- Listing 2 - Simple Controlflow in Dataflow Kernel example. specific compilers and runtime Both candidates (x+1 and x-1) are simultaneously computed in space implementations. and only the correct value flows out via the SCSVar result. class Simple Kernel extends Kernel { The most frequent question at SimpleKernel() { this point is: How about condi- SCSVar x = io.input(“x”, scsFix(24)); tionals? The answer is that “con- SCSVar result = (x > 10) ? x+1 : x--1; ditionals” or “if” statements are io.output(“y”, result, scsFix(25)); not all created equal. There are } really three types of condition- } als that any programmer can distinguish, but that are hard for machines or compilers to Listing 3 - Moving Average Kernel example. reverse engineer as they look Application specific floating point precision numbers with 7-bit at the code: (1) conditional as- exponent and 17-bit mantissa are used. signment, (2) data forks, and (3) class MovingAvgSimpleKernel extends Kernel { separate global paths through MovingAvgSimpleKernel() { a program. SCSVar x = io.input(“x”, scsFloat(7, 17)); SCSVar prev = stream.offset(x, --1); Computing in Space has three SCSVar next = stream.offset(x, 1); separate natural mechanisms SCSVar sum = prev + x + next; to deal with the three types SCSVar result = (sum / 3); of conditionals: (1) is a simple io.output(“y”, result, scsFloat(7, 17)); } multiplexor allowing the pro- } grammer to select between two data producers. (2) is a fork in one graph or a fork driving data to two different graphs at the same time. (3) requires a bit of work in separating the differ- ent global paths into separate code and implementation. The code reorganization required by (3) is the most unpleasant part for the modern program- mer, especially ones trained in the art of C++. Luckily, the un- tangling of global paths only has to be done to the parts of the program that take most of the time, which are typically small parts.

So now that we can program The Kernel graphs of the two examples. in 2D space, we need actual All coefficients (10, 1 and 1 on the left computers to do this on. The and 3 on the right) can be set externally using the SCS IO interface. In the right standard describes the con- graph integer, data values are used cept of spatial computing sub- instead of the example’s custom floating strates, or architectures that point (7, 17) for the sake of simplicity. HighPerformanceComputing Americas | MARCH 2014 73

support the spatial computing wearable computing, embed- infrastructure will be needed. paradigm. While the standard ded devices, mobile terminals However, due to the massive is in its infancy, and is still be- and smart dust computing. and global scale of this uni- ing refined, we already have verse, implementing central- a first commercially available One promising domain is Mo- ized cloud infrastructure to substrate: Multiscale Dataflow bile Supercomputing. Leverag- support such real-time super- Computing by Maxeler Tech- ing the OpenSPL standard and computing is not the most ef- nologies. With a range of serv- a new substrate for real-time ficient approach. Distributing er, networking and soon also sensing and processing, mobile Petascale mobile supercom- storage products for high end supercomputing systems with puters to the edge of the cloud enterprise-level computing, order of magnitude reduction is much more efficient and pro- such a substrate can be applied in both power consumption vides much better service. to a wide range of application and physical system size can domains and computing activi- become feasible. Such mobile The OpenSPL standard based ties. supercomputing systems will on the dataflow computing become essential in the emerg- model with appropriate sub- Multiscale Dataflow Computing ing Mobile Computing Uni- strate supported by powerful already comes with simulators, verse. software tools offers the po- debuggers and a university tential of achieving two orders program having brought ac- The emerging Mobile Comput- of magnitude improvement cess to the technology to over ing Universe consists of the on the performance/power 100 universities worldwide. As cloud infrastructure, personal ratio, and the opportunity to the standard continues devel- mobile devices, and embed- build mobile supercomputer oping, there could be many ded environmental sensors (or systems that can be widely de- new Substrates for Computing “Internet of Things”). To sup- ployed without requiring spe- in Space. Such new substrates port the real time processing cial machine rooms and the as- could also bring an order of of massive amounts of mobile sociated supporting infrastruc- magnitude reduction in power data and the performing of ture. There is a real opportunity consumption and physical size deep analysis and inference to bring power and energy ef- of the device to a wide range to extract value from such big ficient supercomputing to the of additional domains such as data, Exascale supercomputing mobile mass market. 

The MPC-X1000 providing eight dataflow engines.