Escola Tècnica Superior d’Enginyeria Electrònica i Informàtica La Salle

Final Thesis

Electronic Engineering

A Hybrid Beowulf Cluster

Student Tutor

Gonçal Roch Colom Dr Joan Verdaguer-Codina Dr Jordi Margalef i Marrugat ACTA DE L'EXAMEN DEL TREBALL FI DE CARRERA

The evaluating panel meeting on this day, the student:

Gonçal Roch Colom

Presented their final thesis on the following subject:

A Beowulf Cluster with Intel, AMD and ARM Nodes for Teaching and Research

At the end of the presentation and upon answering the questions of the members of the panel, this thesis was awarded the following grade:

Barcelona,

MEMBER OF THE PANEL MEMBER OF THE PANEL

PRESIDENT OF THE PANEL A Hybrid Beowulf Cluster

Gonçal Roch-Colom

ETSEEI La Salle, Universitat Ramon Llull Tutor: Dr Joan Verdaguer-Codina Co-Tutor: Dr Jordi Margalef

Year of Presentation: 2013 Hybrid Beowulf Cluster

Abstract

Every day, all over the world companies, public and private institutions, and households alike dismiss thousands of old computers. Most are perfectly fine, some are still quite powerful, but they are being replaced with brand new, x86 based units, be it PCs or Macs. Corruption in existing Windows installations, minor hardware faults, ommitances in manually updating the hardware, or generally being deemed 'too old' often lead to their demise. Recycling those forlorn but fully functional pieces of hardware into nodes of a powerful for high performance seems not only a fascinating challenge but a worthy cause as well, especially in the teaching arena. Extra spice shall be thrown in in the shape of ARM SoCs, a building block to prepare our students for their future role in society.

i Hybrid Beowulf Cluster

Sumari

Cada dia, a tot el món, empreses, organismes públics i privats i particulars estan llençant milers d'ordinadors vells. La majoria estan perfectament bé, alguns són encara molt potents, però els estant canviant per altres de nous amb processadors x86, ja siguin PCs o Macs. Normalment la corrupció en els Windows instal·lats, petits problemes o manques d'actualització en el hardware, o considerar els ordinadors “ja massa vells” condueixen a la seva desaparició. Reciclar aquest hardware perfectament vàlid tot fent-lo formar part d'un cluster d'ordinadors per a la computació distribuïda d'alt rendiment ens sembla no només un repte fascinant, sinó també una bona causa, especialment aplicada a l'educació. Afegirem un toc addicional fent servir també SoCs amb ARM, bàsics per preparar els estudiants per al seu futur.

ii Hybrid Beowulf Cluster

Acknowledgements

I wish to thank Joan Verdaguer for supplying ideas, ceaseless manuscript follow-up and improvement, and believing in me over the years; Jordi Margalef for his help and academic guidance; my wife for encouraging me to pursue this endeavour when time was the scarcest resource; my colleague Miquel Soler for helping out and brainstorming with me; my parents for believing in learning; Steven Vickers for his witty Jupiter Ace User's Guide which shaped my mind in my mid teens; all involved in the inception of the ARM architecture for a source of fascination to me; and last but not least, the great Charles Dickens whose work deeply influenced my life.

iii Hybrid Beowulf Cluster

Table of Contents 1 Motivation and Objectives...... 2 1.1 Motivation...... 2 1.1.1 Reasoning behind the Motivation...... 2 1.1.2 Personal Motives...... 2 1.2 Objectives...... 3 1.2.1 Using this Project in the Academic world...... 4 1.2.1.1 Students...... 4 1.2.1.2 Teachers...... 4 1.2.1.3 Teacher Training...... 4 1.3 Open Source...... 5 2 A Changing Teaching Environment...... 7 2.1 A Revolution in Teaching: Using Computers...... 7 2.2 The Initiatives...... 7 2.2.1 Britain...... 8 2.3 The Everis Poll, STEM...... 9 2.4 , not Windows...... 11 3 Beowulf and its Background...... 14 3.1 Definition of a Beowulf Cluster...... 14 3.2 Hybrid and Heterogeneous...... 14 4 Brief History of Architectures...... 17 4.1 x86 dominates the PC scene...... 17 4.2 The Relevance of ARM...... 18 4.2.1 The ARM in the Pi...... 19 4.3 x86 vs ARM...... 20 4.4 Tablets, smartphones, netbooks and aspiring desktops...... 21 4.5 1980s - Occam and Transputers...... 22 5 Brief History of OSs...... 24 5.1 MS-DOS and Windows...... 24

iv Hybrid Beowulf Cluster

5.1.1 The Demise of the Home PC...... 25 5.1.2 Apple's dominance and the ...... 25 5.1.3 The Future of the Personal Computer...... 26 5.2 Early 1990s...... 26 5.2.1 MINIX...... 26 5.2.2 Linux...... 27 5.3 1990s...... 28 5.4 2000s: Render Farms...... 28 5.5 Beowulf as a Valid Alternative...... 29 6 The Future...... 31 6.1 Mont-Blanc...... 31 6.2 UPC's Scientific, Technical and Educational Training...... 32 7 The ...... 34 8 A Heterogeneous Cluster...... 37 8.1 Symmetric Load Balancing...... 37 8.1.1 Finest Grain: the Coimbra approach...... 37 8.1.2 Benchmarking nodes: the HINT benchmark...... 37 8.2 Asymmetric Load Balancing...... 38 8.3 Quibbles...... 38 8.4 Potential for Further Research...... 38 9 HBC Overview...... 41 9.1 Network Topology...... 41 9.1.1 Head / Master Node: RaspberryPi...... 41 9.1.1.1 RaspberryPi99...... 41 9.1.1.2 Rest of Nodes...... 41 9.2 Software...... 41 9.2.1 Raspbian OS on Master and Compute Nodes...... 41 9.2.2 Ubuntu 10.04.4 OS on Compute Nodes...... 42 9.2.3 Rest of software needed...... 42 10 Hardware - Building the System...... 44 10.1 Picking up the Bits...... 44

v Hybrid Beowulf Cluster

10.2 The Dell Poweredge 1500sc Server...... 44 10.3 The 3 HP Kayaks...... 45 10.4 The Webgine 1115XL laptop Hardware...... 46 10.5 The HP Pavilion AMD64 based laptop...... 46 10.6 The AMD based brandless desktop...... 47 10.7 The Raspberry Pi...... 47 10.8 Communications...... 48 10.9 Physical Layout: Positioning the PC's...... 48 11 Network – Building the System...... 51 11.1 The Downton Cluster...... 51 11.2 IP addresses and node names...... 51 12 Software - Building the System...... 53 12.1 Previous Considerations...... 53 12.2 General Procedure on PC's...... 53 12.2.1 Partitioning...... 53 12.2.2 Reverting to Old GRUB...... 53 12.2.3 Intel Microcode...... 54 12.3 Installing MPICH2...... 54 12.3.1 Creating specific user hbcuser...... 55 12.3.2 Installing NFS on master and compute nodes...... 56 12.3.3 Ensuring passwordless communication between nodes...... 56 12.4 Setting up for hybrid operation...... 56 12.5 Node-specific executables on each node...... 57 12.6 A few quibbles...... 57 13 Accessing the System...... 59 13.1 Locally...... 59 13.2 Remotely...... 59 13.3 Setting up each node (ARM, x86)...... 59 13.4 Worldly Considerations – On / Off...... 60 14 A Test Run...... 62

vi Hybrid Beowulf Cluster

14.1 Choosing an Example...... 62 14.2 Running the Example...... 63 15 Feeding Mammoth Tasks to the System...... 66 15.1 Previous considerations...... 66 15.2 Unoptimised Linpack benchmarking tool...... 66 15.3 HPL (High-Performance Linpack) Benchmark...... 67 15.3.1 Installing, compiling HPL on Pi and on all x86 machines..67 15.3.2 Compiling on Cora...... 68 15.3.3 Know thy cluster...... 68 15.3.4 Atlas libraries and floating point capabilities...... 69 15.3.5 Setbacks...... 70 15.3.6 Compiling xhpl on AMD: 3dnow...... 70 15.3.7 Compiling xhpl on PentiumIII, SSE...... 71 15.3.8 Rough results, RaspberryPi as Master node...... 71 15.3.9 Using Mary as Master node...... 72 15.4 Pre-conclusions on system balancing...... 72 15.5 The NAS Benchmark...... 72 15.6 The Ping-Pong Test...... 73 15.7 Real World Tasks...... 73 15.8 MPICH and Python...... 74 16 Starting up...... 76 16.1 Firing Up the system...... 76 16.2 Starting HBC...... 76 17 Power, Efficiency and Environmental Considerations...... 79 17.1 Cooling...... 79 17.1.1 A Pi Quibble and a Bent Card...... 79 17.2 Room Temperature...... 79 17.3 ...... 80 17.4 Mflops per Watt on HBC...... 81 17.5 Performance on HBC...... 82

vii Hybrid Beowulf Cluster

18 Costs and Considerations...... 84 18.1 Man-hours...... 84 18.2 Electricity...... 84 19 Smartphones for Extended Eclecticism...... 86 19.1 Android phones...... 86 19.2 Linux on a Phone vs Pi...... 86 20 Conclusions...... 88 20.1 Low cost...... 88 20.2 Raw computing power...... 88 20.3 The need for Physical Nodes...... 88 20.4 Setting up a particular Cluster Architecture...... 89 20.5 Drawbacks...... 89 20.6 Tinker Around with Thy Cluster...... 89 20.7 Showing the World...... 89 20.8 The Future...... 89 A List of Hardware...... 92 B Benchmarks. Power consumption...... 94 B.1 Local Benchmarks. Individual consumption...... 94 B.2 HBC Benchmarks...... 95 B.3 Relevant HBC Benchmarks...... 96 C Code...... 98 C.1 Example icpi-info...... 98 C.2 The HPL suite...... 99 C.3 Scripts on Master Node...... 100 C.4 The HPL.dat file...... 101 D Dept of Education Master Plan Chart, 2013 - 14...... 104 E References...... 108

viii Hybrid Beowulf Cluster

List of Figures The meaning of STEM...... 10 Spanish educational systems confronted to technology trends and relating to industry reaction...... 10 An Intel 8080 microarchitecture block diagram...... 17 One of the two Intel Pentium III Coppermine CPUs in MrCarson...... 18 An ARM's block diagram. (Source: arm1176jzf-s Technical Reference Manual, infocenter.arm.com)...... 19 The Raspberry Pi's Broadcom ARM SoC and its RAM memory that is soldered directly on it...... 20 Diagram of a Raspberry Pi.E.18...... 34 Our nicely clad Raspberry Pi...... 34 A Raspberry Pi Software Block Diagram.E.20...... 35 The Dell Poweredge on the way to its destination...... 44 Inside the Poweredge 1500sc. A well made piece of hardware, and a heavy one too...... 45 Updating the Poweredge's BIOS...... 45 A Pentium III Katmai processor. Left to right, assembly clip, rear plastic cover, clip, radiator and main processor board with processor and cache...... 46 The AMD Athlon brandless desktop. Fan assembly at the top left, under the Conceptronic Ethernet switch...... 47 Raspberry Pi's raspi-config helper...... 48 The two Raspberry Pi's on HBC and the residential gateway. An external HD for backup purposes...... 49 The x86 machines on HBC...... 49 An Intertek power meter measuring MrCarson's power consumption...... 81

ix Chapter 1

Motivation and Objectives

1 1 Motivation and Objectives

1.1 Motivation In every single task accomplished by humans there is always a motivation driving it, be it philantropy or enterprise. We also find those behind academic work. Some end of degree dissertations are produced under GNU licence so that distribution and giving back to the community are ensured. Collective good material belonging to the community means the free availability of software for any kind of use, but it also means that both development and debugging – getting rid of errors – is more easily done by a wide community than a small(er) bundle of individuals. That is exactly what Linus Torvalds meant when he created Linux under GNU licencing. Other end of degree dissertations are meant to investigate and / or solve a problem experienced by a private enterprise. Belonging to those two categories are end of degree dissertations by students who wish not only to finish their degree but also meet a challenge – that of 'the showing by students that they can use in the real world what they have been learning in their studies'. That is the premise from which the definition for an end-of-degree project stemmed.

1.1.1 Reasoning behind the Motivation

The reasoning behind the motivation oftentimes blends with the motivation itself. 'Why do you do this?' is answered to with the simple reply 'I have been told to' as well as 'I like it', and further and possibly more importantly, 'I want to meet the challenge'. Far from clashing, those are simply related to purely rational and purely emotional viewpoints. Rational is where an academic need stems in order to finish a degree and emotional is where the individual in question realises there is a challenge to be met which would not have been there, had there not been the need to finish the aforementioned individual's degree. The author of this project then posed himself many questions when it actually came to choosing a subject for this project; some of which he had not asked himself before. The three reasonings were applicable and a Beowulf cluster was chosen.

1.1.2 Personal Motives

In this rich, fast-paced, ever-changing world or ours, we humans increasingly have access to a good deal of material, in the shape of both information (which includes both

2 misinformation and knowledge) and hardware (useful and useless things). Unfortunately this all comes at a cost, and students need to be taught to use freely available and scarce resources properly. On the one hand, misinformation does not help any community, nor does freely available and accessible, but ethically questionable, material. Equally, on the other hand manufacturing of hardware comes at an environmental cost. Each and every bit of hardware leaves an environmental footprint. The amount of energy used to produce it can be calculated and pitched against its own real-world functionality, thus actually justifying its recycling and reusing. That is why getting the most out of a piece of hardware owned, used, or get access to is one of our pet challenges. Therefore reusing old computers (desktop PC's) on the one hand, and utilising extremely simple and inexpensive ones on the other (Pi's) falls into place.

1.2 Objectives There are many ways of building a Beowulf cluster and a few possibilities when it comes to hardware as well. As the options with Dr Verdaguer were discussed, the wish to do something for the educational world prevailed. Something that would motivate and amaze pupils. In fact the idea was taken from the Stateside STEM education concept (Science, Technology, Engineering and Mathematics) 2.3 which also prevails in the UK and Europe – namely, encourage and motivate pupils to take up studies in science and engineering. A decision was then made to build a Beowulf cluster not only using M2COTS 3.1 , ubiquitous computers, but adding a little extra spice in the shape of making it different, and also being able to play around with it from home. A new device had been made available only recently (and at that time sales were still rationed) as steps were being taken to define and register a title for this project, namely, the Raspberry Pi board computer. Raspberry Pi was seen as much more interesting than other board computers available (like those that were x86 based or ARM based, but more expensive) for a variety of reasons. It is powerful, inexpensive, and can be used as a standalone computer or as a thanks to its dedicated I/O pins. Raspberry Pi's (meant to sound as yummy Pies) are perfect building blocks for students' electronic Meccano sets. Dr Verdaguer warned that using this project for implementation in real world schools presented its difficulties due to the school world being subjected to constant change which may or may not be obvious to outsiders. This was important since private enterprise is trying to introduce many changes in the school system, some of which are not applied

3 taking into account the environment they are introducing them in. Implementation of a Beowulf cluster in a particular way is not the only goal of this dissertation; there are complementary goals we wish to attain. Quoting Ian Livingstone's words in his interview with Tech V3: "Code is at the heart of everything we do in the digital world in which we exist. It’ s not just about video games and visual effects, it’ s also about designing the next jet propulsion engine, or fighting cybercrime, or running financial services. "Coding is essential to everything, and with traditional manufacturing in decline and financial services in disarray, if the government wants the economy to succeed, you have to empower our creative nation with the skills necessary to serve digital content to global audiences via high-speed broadband, and code is absolutely essential to that." E.1

1.2.1 Using this Project in the Academic world

Chart in Appendix D lays out the Departament d'Ensenyament's guidelines and this Project will serve those below.

1.2.1.1 Students

Development, follow-up and assessment of students' 'digital competence' as 'cross competence'. See 1.1.1d on chart D. Boosting resources and strategies for dynamising participation and communication spaces within the teaching community. See 1.3.3.e on chart D. Digital Inclusion: accessing teaching resources, digital working environments and digital media. See 1.4.4.c on chart D.

1.2.1.2 Teachers

Methodology, scientific, technical and teaching training and updating. See 2.8 on chart D: 2.8.8.e Programme for Updating Teachers on Science, Technology and Teaching. 'Digital cross competence' of teachers in the aspects of teaching relating to technology, communications, methodology, axiology and professional improvement. See 2.8.8.g on chart D.

1.2.1.3 Teacher Training

See 3 on D: Developing professional competence in teaching staff and teaching services. Training Teaching staff to be coherent in keeping training priorities, see 3.9. Training as a tool to ensure students' success in schools, see 3.9.9.a. Boosting teaching staff's digital teaching strategies as a 'cross competence', see 3.9.9.c.

4 1.3 Open Source Throughout this project only open source software, and in some cases material, will be used. This applies to the software needed to run the system and to write this dissertation, edit and insert photos and charts and so forth. This will be in line with the global open source community philosophy. There are sound down-to-earth reasons as well: open source software and hardware tools have been hugely improved and refined over the years and have thus reaped huge success, a paramount example being Linux 5.2.2 E.47the operating system of choice for .

5 Chapter 2

A changing Teaching Environment

6 2 A Changing Teaching Environment In the world of primary and secondary school there are discussions and arguments caused by an ongoing transition to a system in which obtaining tangible results becomes paramount within a global context in which economics play a key role.

2.1 A Revolution in Teaching: Using Computers Sweden is the European country with more computers used in teaching. However, neighbouring Finland scored higher in PISA (Programme for International Student Assessment). Swedish education expert Inger Ekvist explained how the Swedish did not understand why that happened and in Finland they pointed out to her that when the school system was reformed in their country in the 1970s (incidentally the same decade as Spain's) they went for technology as a base subject, helped by mathematics and language, for preschooling and primary. Some argue there are too many computers in Sweden when it comes to education and that they could be done without when it comes to teaching. It is also reasoned that computers are not the way to improve a country's economy. Debate is raging all over the world, and in the US they are looking at whether it would be possible to do away with teachers altogether and replacing them with computers, which would help tailor teaching to each pupil's needs and capabilities. Isaac Asimov's words ring familiar; he envisaged using robot teachers to solve problems in the teaching field, especially in the teaching of English. South Korea are producing English-teaching robots for preschooling and primary and 8,400 robots should be delivered through 2013.E.2

2.2 The Initiatives In the US various methodologies are being tested and used like using paper or not using it at all. However, they all involve using traditional or tablet computers. Schools' investors and sponsors are the ones making the choices. We have been going through similar motions here in Catalonia when the 1x1 programme was cancelled after a change in the

Government's political colourE.3 and programme 2.0 was brought along which meant many schools who had chosen to go down the 1x1 road were left feeling out of place. Some teaching analysts complained there had not been time to fully evaluate the 1x1 programme as implemented by only some of the schools as it was withdrawn too early. Some private institutions have also decided to go their own way, and Fundacio Bofill decided to implement a US programme called Magnet based on direct cooperation between schools and cultural and scientific model institutions like MACBA (Barcelona's Museum of

7 Contemporary Art)E.4 On a European scale there is much debate and diverse choices are being made; in Estonia they decided to teach programming in the 6 to 16 years of age period. This comes as no surprise since Estonians gave birth to Skype itself. They are using Scratch, a graphic environment which helps teach macro programming and eventually runs the simple programs devised by the pupils. This is also being used at some primary schools in Catalonia.

2.2.1 Britain

The United Kingdom has a rich and long tradition in innovation in general and computing in particular. The first machine that was in itself a computer was devised by an English engineer: Charles Babbage first designed his Differential Engine in 1835 (an advanced calculator) and his Analytical Engine in 1837, the first Turing-complete computer in History. An efficient electronic computer, Manchester's Baby, was designed in Britain with a measly (back then) power consumption of 3.5KW and was the world's first stored-program computer, and powerful, efficient were consistently designed in the UK before anywhere else. Many personal computers were designed in Britain: the Tatung Einstein, Sinclair ZX's, the Oric line, BBC Micro, or the Acorn line (whose inventiveness brought us the ARM CPU). And Sir Tim Berners-Lee invented HTTP, HTML and effectively the World Wide Web, as showcased in the 2012 London Olympics. In Britain there has been one of the more heated debates as minister of Education Michael Gove got through some changes to the system. Now the whole country understands that learning computing is not about using Office-like suites but rather being able to write computer programs in a computer language as early as possibleE.5: Another misconception that is currently rife in the debate about a new curriculum is that the primary rationale for it is economic: we need more kids to understand this stuff because our "creative" industries need an inflow of recruits who can write code, which in turn implies our universities need a constant inflow of kids who are turned on by computers. That's true, of course, but it's not the main reason why we need to make radical changes in our educational system. The biggest justification for change is not economic but moral. It is that if we don't act now we will be short-changing our children. They live in a world that is shaped by physics, chemistry, biology and history, and so we –– rightly want them to understand these things. But their world will be also shaped and configured by networked computing and if they don't have a deeper understanding of this stuff then they will effectively be intellectually crippled. They will grow up as passive consumers of closed devices and services, leading lives that are increasingly circumscribed by technologies created by elites working for huge corporations such

8 as Google, Facebook and the like. We will, in effect, be breeding generations of hamsters for the glittering wheels of cages built by Mark Zuckerberg and his kind. Is that what we want? Of course not. So let's get on with it. – Excerpt from Why all our kids should be taught how to code, John Naughton, The Guardian Even if the British agree they need to improve their education system, the environment in the UK is dynamic enough for Google to decide to donate 15,000 Raspberry Pi's to schools so that pupils work on programming and develop projects revolving around this device. Why this does not happen in Catalonia is rooted in many circumstances, one of which being the fact that LOGSE law redefined Technology as a school subject and after only 5 years the hours were reduced from 280 to 140 in ESO. So-called experts in Spain decided to talk about CTM rather than CTEM which would have been the correct way to translate STEM.

2.3 The Everis Poll, STEM Everis is a multinational company headquartered in Madrid specialising in Consulting, IT & Outsourcing Professional Services. It was founded in Madrid in 1996 as a subsidiary of DMR Consulting in Spain. In 2006, following the 2004 management buy-out of the 100% of the company from , the company changed its name to its present oneE.6. Everis, supported by “” e-motiva and by the Department of Education of the Generalitat of Catalonia, conducted a poll on the factors influencing choices by secondary school students (ESO K-7 to K-10) and baccalaureate students (collegeE.8 K-11, K-12) of science, technology and mathematics, according to the students' perceptionsE.7. The study polled more than 4,700 students in the 3rd and 4th years of ESO (K-9, K-10) and baccalaureate and has been achieved through questionnaires about factors influencing choice of academic subjects in the CTM field. The presentation was named Influential Factors in Choosing CTM Studies. Views of students in 3rd and 4th ESO and Bachillerat. Everis used the acronym CTM referring to Science (C) Technology and Mathematics. We will compare CTM to STEM. STEM is an acronym used as an education term in the US and UK. It is a word which can be defined asE.9: A slender supporting member of an individual part of a plant such as a flower or a leaf; also, by analogue the shaft of a feather.

9 It is a noun and a verb; stems originate (ie stem) from a seed and reach up to the sky. The growth of a plant is a metaphor for the growth of knowledge thanks to proper education. STEM also has a deeper meaning regarding education policyE.9: STEM fields or STEM education is an acronym for the fields of study in the categories of science, technology, engineering, and mathematics. The acronym has been used regarding access to United States work visas for immigrants who are skilled in these fields. It has also become commonplace in education discussions as a reference to the shortage of skilled workers and inadequate education in these Illustration 1: The meaning of STEM. areas. The initiative began to address the perceived lack of qualified candidates for high-tech jobs. It also addresses concern that the subjects are often taught in isolation, instead of as an integrated curriculum. Maintaining a citizenry that is well versed in the STEM fields is a key portion of the public education agenda of the United States. It is clear what importance STEM has in the US. Science, technology, engineering and mathematics are needed Stateside for level 4 and level 5 jobs according to the EU directives and those are essential for the US's education and employment policies. Strikingly, Everis did not use the word Engineering in their presentation. The reason justifying this paradox is where the Industrial Revolution took place in Europe. Boixareu, President of UPC's Social Board, explains that the only part of the Iberian Peninsula to have been influenced by the IR was Catalonia. The mentality that does away with the E in STEM; not only the letter, but the concept; reckons that things can be achieved thanks to science and technology only. Illustration 2: Spanish educational This is a fundamentally flawed view. It systems confronted to technology trends leads to awkward situations like a and relating to industry reaction.

10 company outsourcing manufacture of 11,000 bicycles to a company outside Catalonia (Copenhagen) on the grounds that banks would not give them credit. And more worryingly, to students not choosing Engineering because they have not heard it mentioned. Dr Verdaguer suggested an experiment to be performed by a school teacher on his 4 th grade ESO students. He asks them what the ENG key on a calculator is there for. More often than not, not one of them knows that the key is there to express numbers in Powers- of-Ten and show them in Scientific E Notation. The Everis poll suggests that encouragement is needed when pupils find themselves at the crossroads whether to choose science or arts between 3rd grade, Primary School, and 1st grade, Secondary School, so that they can decide to choose STEM if they so wish. In a project at the States' National Science Foundation they are looking at producing 3D printing as a new way of looking at STEM, an "on ramp for the future"E.9:

The lab school will also serve as an "on ramp" for the Commonwealth's larger advanced manufacturing initiative. A study done by The Boston Consulting Group suggests that over the next 20 years, advanced manufacturing could create 15,000 to 20,000 new jobs in the Commonwealth. Filling these positions, however, will require redesigning the K through 12 curriculum. "We have to realign our courses with modern technology," says Haj-Hariri. "We can't use a 1950s curriculum and we can't make this a 21st century version of shop. We want to get the E in STEM but our goal remains teaching science concepts. Engineering creates the context for us to introduce students to those concepts in a way they actually will understand and retain." "Changes in technology change what's possible. This project has moved incredibly fast, but it will level the playing field and enable all students to learn about these concepts." Accessibility is one of the fundamental aspects of the project.

2.4 Linux, not Windows Dr Verdaguer remarks ESO and Bachillerat pupils are purely Windows users. Within the 1x1 programme a localised into Catalan was introduced in the shape of Linkat. Linkat was completely left aside by pupils. They ended up not being able to perform simple tasks like creating folders and moving files, which they are already doing in an automated way in Windows. Raspberry Pi is an Open Source project and as such not only can the board's design and schematics be obtained for free but it runs open-source Linux software, more secure and powerful than Windows, as its philosophy is based on MIT's .

11 Most importantly, there are countless resources on the Net on how to set up your Pi, starting by powering it up using a mobile phone charger and going on to very basic user oriented tasks (excellent device for browsing the Net and editing Office documents) and on to any sort of advanced tasks (compiling kernel modules, turning it into a powerful networked computer and / or microcontroller). Raspberry Pi as a device and a concept is therefore seen as a key ingredient in teaching students in Catalonia so they evolve into individuals injecting competitiveness into local businesses. Dr Eben Upton is Raspberry Pi's co-founder. Dr Upton's argument helps in order to make this pointE.11: Dr Upton and colleagues at the Cambridge Computer Laboratory had been concerned at the declining figures for young people interested in computer science. Peaking in the mid-1990s at 500 people applying for 80 places each year to read computer science, by five years ago the number of applicants applying for the same number of places had halved. That decline, says Dr Upton, was parallel to a decline in the skills and experience of the undergraduates taking the places. Many had tinkered with websites but few had touched programming. This created a real problem because the University needs a good supply of potential candidates and does not want to spend the first year bringing people up to speed. “Three years after it is a problem for the University it is a problem for industry because a decline in the number or skill set of applicants turns into decline in the number or skill set of graduates.” Initially it was hoped that Raspberry Pi, by being aimed at children, would increase numbers looking to study computer science at university. Two additional markets have now emerged, one of which is the hacking or ‘’ maker community who already have some knowledge of the technology and want to build projects. The second market to emerge, says Dr Upton, was that of the developing world.

12 Chapter 3

Beowulf and its Background

13 3 Beowulf and its Background There has always been Beowulf in the making. The Beowulf concept in itself was officially born in 1998 and included in the Linux Documentation Project. This is how Jacek Radajewski and Douglas Eadline described Beowulf in the Linux Documentation Project (document now hosted at the MIT server): Beowulf is not a special software package, new network topology, or the latest kernel hack. Beowulf is a technology of clustering computers to form a parallel, virtual . Although there are many software packages such as kernel modifications, PVM and MPI libraries, and configuration tools which make the Beowulf architecture faster, easier to configure, and much more usable, one can build a Beowulf class machine using a standard Linux distribution without any additional software. If you have two networked computers which share at least the /home file system via NFS, and trust each other to execute remote shells (rsh), then it could be argued that you have a simple, two node Beowulf machine.E.26

3.1 Definition of a Beowulf Cluster A Beowulf Cluster is a computer system made up of a number of computers connected to a network. The cluster is given a complicated task to perform and each of the computers (which is called a node) in the cluster performs a sub task that will help complete the task. According to Robert G. Brown (Duke Physics, Durham, US), a true Beowulf is a cluster of networked computers complying to the following requirements and characteristicsE.12:  The nodes are dedicated to the beowulf cluster.

 The network on which the nodes reside are dedicated to the beowulf cluster.

 The nodes are Mass Market Commercial-Off-The-Shelf (M2COTS) computers.

 The network is also a COTS entity.

 The nodes all run open source software.

 The resulting cluster is used for High Performance Computing (HPC).

3.2 Hybrid and Heterogeneous Different hardware architectures will be used in this Cluster:  x86 based hardware (power hungry and sometimes more performing due to higher clock frequencies; widely available in PCs and Macs),  ARM based hardware (extremely power efficient, more stable, and widely available

14 in SoC devices like mobile phones, routers, or Tvs). Some of the reasons for doing this are: • For many applications, ARMs will do nicely, and will happily replace a larger, power-hungry PentiumIII-500MHz based PC with a huge 512 MB of RAM. • This Beowulf cluster involves recycling those old computers that are available and the hardware will be homogeneous only if we are lucky. In our case three HP XM600 Kayaks were obtained with similarly clocked PentiumIII CPU's and. Two RaspberryPi's belong to the project as well. • A Raspberry Pi is the ideal control node. It can be kept powered on at all times without power or safety concerns. It can also be used to gain access to the whole system from the outside world. • One of the purposes of this project is evaluating how a Pi will perform in a Beowulf environment when teamed up with particular nodes on the cluster.

15 Chapter 4

Brief History of Architectures

16 4 Brief History of Architectures

4.1 x86 dominates the PC scene Intel brought out their 8008 back in 1972 and more relevantly, their 8080 (evolved from the 8008) in 1974. Those CPUs were initially mostly aimed at (Seiko wished to use a 8008 in a calculator). Still, they were enhanced and improved, and ended up being computer microprocessors, used in computing applications like IBM's PC and many other PC's in that era like HP's 150. The microarchitecture continued to be developed and improved but it was all based on the very first specification and in order to keep backward compatibility all subsequent designs were still 8080 based. Eventually a RISC version came out in the 1990s – the PentiumE.45 – which was also improved, with all the iterations ever since – MMX, II, III, the awfully inefficient IV, HT (single core multi ) and D (double core), then the Illustration 3: An Intel 8080 more efficient M (Mobile, based on P III) microarchitecture block diagram. and Centrino (based on M). Nowadays Intel's i3/i5/i7 CPU architecture chipsets (codenamed Ivy Bridge) are ubiquitous. But the whole x86 family is based on a much too simple design premise: we were building microcontrollers in the distant past. A microcontroller CPU to be running complex operating systems can spell trouble. This is just what we have: it is designed to wake up and first thing, jump to an address specified by the user (or the program initially run by the user) and execute the code starting at that address, thus more vulnerable to viruses and backdoors. Also this used to be a very simple microcontroller CPU – instructions are not labelled 'data' vs 'code' and they are not assigned privileges depending on them belonging to an application, a driver, or the operating system.

17 This all has been taking us to the current computer security nightmare scene. Microsoft operating systems have been written leaving a series of backdoors open (it has been speculated there have been reasons to do so ie let MS people or their accolades gain access to systems) and x86 architecture helps. (Interestingly WindowsCE, which was written for ARM CPUs, is virtually invulnerable, and companies writing antivirus software for Illustration 4: One of the two Intel WinCE tried to sell their services but were Pentium III Coppermine CPUs in unable to provide updates to 'virus MrCarson. databases' since there were none.) Simple instruction set modifications have been brought in (like Data Execution Prevention, or DEP, implemented in their hardware) but they are still based on the original 8080. And they are plagued with vulnerability issues. That is why Intel bought out McAfee. Sadly all Linux versions are known to have a number of vulnerabilities made possible by the x86 architecture itself. Thankfully the design of the operating system itself is rock solid which is a huge help in securing a system.

4.2 The Relevance of ARM In 1981, Acorn Computer brought out the BBC Micro, a home computer designed for educational and research purposes. It came in a box and had its own keyboard, and could be hooked up to a home TV. BBC Micro was Z80 based but was revolutionary in its own way – it came with a ROM based Z80 assembler that could be used alongside its BASIC programming language – something the writer of this dissertation always wanted to get his hands on. Alas, a a Grundy NewBrain had already been bought, a Z80 based computer cleverly designed by Sinclair (of Spectrum fame) to make it to BBC Micro status, although Acorn Computers won the BBC contract. (The NewBrain is now back to life thanks to some solder, a couple of tantalum capacitors, two resistors, a Schmitt trigger IC, and some LEDs to check whether power-up and CPU reset signals behaved in a healthy way.)

18 When Acorn made the decision to design their own CPU they made sure the specification would survive for years to come – bus widths, , and cleverly

Illustration 5: An ARM's block diagram. (Source: arm1176jzf-s Technical Reference Manual, infocenter.arm.com) designed instructions with bits in the code telling each instruction what layer it belongs to (driver / operating system / system application / user application), whether it is an instruction or a chunk of data, and so onE.44. How advanced the specification was and how very power efficient the architecture was in itself were two factors that meant ARM chips are to be found in every single smartphone around the world. (Acorn Computers went through a variety of financial difficulties in the second half of the 1980s, but a spinoff called ARM Holdings (for Acorn RISC Machines) was created in 1990, which continued to develop the ARM architecture.)

4.2.1 The ARM in the Pi

A particular iteration of the ARM CPU is the one used on the Project, namely ARM1176jzf-s. This is the SoC (System On Chip) that is used on Raspberry Pi's. It comprises a CPU and a Broadcom Videocore4 GPU. Unfortunately, direct, straightforward access to the Broadcom GPU as a number- crunching unit cannot be gained. Otherwise it would be possible to fully use the chip to its full processing power, and at 24GFlops per chip amazing performance could be obtained. – 270MFlops is roughly what the CPU is capable of

19 – 24GFlops is what the Broadcom Videocore4 GPU is capable of.

Regarding CPU features /proc/cpuinfo says it is java capable and it can run the 'thumb' instruction set, which was originally designed with PDAs in mind so that code took up less space, at the expense of performance though. This is ARMv6 and the ARM Infocenter tells us: ARMv4T and later define a 16-bit instruction set called the Thumb instruction set. Most of the functionality of the 32-bit ARM instruction set is available, but some operations require more instructions. The Thumb instruction set provides better code density, at the expense of performance. ARMv6T2 introduces a major enhancement of the Thumb instruction set by providing 32-bit Thumb instructions. The 32-bit and 16-bit Thumb instructions together provide almost exactly the same functionality as the ARM instruction set. The enhanced Thumb instruction set (Thumb®-2) achieves the high performance of ARM code and better code density like 16-bit Thumb code. E.15

4.3 x86 vs ARM Much has been written and said about the ubiquitous x86 architecture. Things like newer versions of Pentiums coming out disappointing critics and sometimes users, or the original PentiumIV's NetBurst architecture that meant an improvement in some circumstances but a step back in performance as related to the PentiumIII Illustration 6: The Raspberry Pi's and huge amounts of heat generated, have Broadcom ARM SoC and its RAM already been much discussed. For that memory that is soldered directly on it. reason this will not be delved into too deeply here. x86 based CPUs are power hungry and generate too much heat, those two reasons being behind the fact they are not to be found in a single smartphone nowadays. RIM (the BlackBerry makers) did use an 80386 chip in one of their early devices (2001) but quickly switched over to the more efficient chips. For that reason, mobile phone manufacturers have been using ARM CPUs since the very beginning. Nokia's 3650 used an ARM CPU clocked at 104MHz back in 2002. HP's iPAQ 6340 (which run WindowsCE v4) used an ARM at 168MHz. And here comes the big question – why not make a desktop with an ARM CPU in it?

20 PC's would be greener – and much more secure. ARM architecture is so secure in itself that it has even proved nearly impossible to write viruses and malware for Windows CE running on ARM. The answer to that is Intel's market dominance. X86 chips were to be found everywhere. Software had been written for them for decades. Admittedly PCI bus architectures which had been evolving around x86 hardware were more efficient than what was (and is) available for ARM CPUs. (Possibly Acorn's Archimedes designers have something to say about that – again, big Intel's market inertia was to kill mostly everything else off.) But bus architectures for mobile devices (ultimately SoCs) continued to evolve as well. And SoCs became available which were as efficient as their x86 counterparts, Mhz per Mhz, at a fraction of the power needed by their Intel counterparts.

4.4 Tablets, smartphones, netbooks and aspiring desktops A number of ARM based netbooks with minimal power consumption and high specs started to appear but for some reason they have been neither publicised nor developed. One suspects the Intel faction is behind that fact. Intel has been fighting back with their Atom processors, no more than a hardware implementation of an x86 making it as power efficient as possible, with special 3D wiring inside the chip. Which made them into very power efficient x86 CPUs, but nothing more – they still heat up considerably and the more powerful units need a fan, the less powerful ones a good wide heatsink. Unlike ARMs. Apple realised how important ARM was to their own business as soon as it became obvious only an ARM could power their smartphones and tablets and they bought a 10% stake at ARM. However, as pointed out by an ARM executive, the company's wish is to remain independent so as to be as OS independent as possible. ARMs can run Windows 8, iOS, QNX, Windows CE, and many more, including of course Linux. In 2012, the Raspberry Pi Foundation brought out a minimalistic version of a real computer, for educational and research purposes. They based themselves on the BBC Micro. Raspberry Pi is a SoC computer based on an ARM core and 256MB RAM, costing only around €35 and coming with a composite video output, HDMI output, two USB ports, one SDHC card drive, and a microUSB power port for use with a generic mobile phone charger. A new version then came out, this one made in the UK and with 512MB of RAM soldered directly onto the ARM CPU. Thus Pi looks set to be a worthy member of a Beowulf cluster. It will also use up very little power so our electricity bill will be a little more reasonable and it will not be

21 necessary to rush out of the Beowulf cluster room when it is active, at least not within the first half hour.

4.5 1980s - Occam and Transputers In the early 1980s talk emerged of a supercomputer made up of many efficient but simpler processing units rather than a single hyperfast unit. This was designed in the UK by company INMOS and a programming language called Occam was designed for the INMOS hardware. It was all about efficiency: cleverly distributed concurrent computing versus localised high performance hardwareE.46. David May was the mastermind behind the INMOS Transputer project. He designed the hardware and the programming language, Occam. (The language was named after Occam's Razor, the principle of parsimony that was born in Occam, England, used in problem-solving which states that when a problem needs to be solved and there are a number of hypotheses, the one making the fewest assumptions should be chosen. The problem-solving principle was given its name only in the Victorian era.) Interestingly, Occam and the Transputer had already something in common with ARM and its compiler, as with most high efficiency RISC CPUs and their compilers: hardware and software (compiler / language) should be designed by the same people in order to be able to get the most out of a system. Wikipedia tells us a bit more about a very early version of Occam and the latter, more useful version, from the late 1980's: occam 1 (released 1983) was a preliminary version of the language which borrowed from David May's work on EPL and Tony Hoare's CSP. This supported only the VAR data type, which was an integral type corresponding to the native word length of the target architecture, and arrays of only one dimension. occam 2 is an extension produced by INMOS Ltd in 1987 that adds floating-point support, functions, multi-dimensional arrays and more data types such as varying sizes of integers (INT16, INT32) and bytes. With this revision, occam became a language capable of expressing useful programs, whereas occam 1 was more suited to examining algorithms and exploring the new language (however, the occam 1 compiler was written in occam 1, so there is an existence proof that reasonably sized, useful programs could be written in occam 1, despite its limitations). E.16

22 Chapter 5

Brief History of Operating Systems

23 5 Brief History of OSs

5.1 MS-DOS and Windows It shall not be reviewed here how MS-DOS and Windows were born, rather an excellent film is recommended, Pirates of Silicon Valley, which also depicts the birth (and initial decline) of Apple. Suffice it to say that MS-DOS was an OS copying most of the concept in Digital Research's CP/M (Control Program for Microcomputers) which did not contemplate more than 64k of memory or more than one task and had been written for Intel's 8080, Zilog's Z80 and other microprocessors. When the x86 80386 iteration was brought out the chip was capable of real-time multi- tasking thanks to its Virtual Mode (as opposed to static multi-tasking as embedded by MS in its MS-DOS 5.0 and supported by 80186 and 80286). MS eventually thought Apple's marketing foresight when it came to promoting and selling graphics based computers and in 1985 released Windows 1.x on MS-DOS adding a set of libraries for a graphics environment. Windows 1.x however did not have multi-tasking functionality. In the meantime, Quarterdeck brought out their DESQView in 1985 which offered real-time multitasking in a text mode environment. Newer versions of Windows (notably 3.x in 1990) fully utilised virtual memory, virtual device drivers allowing sharing of devices amongst running applications, and real-time multitasking. All along, however, Windows was based on (and actually running on, until Windows 2000 emerged) MS-DOS which has beeen plagued not only with bugs but vulnerability issues. This is due to the fact that in order to preserve backward compatibility, new code comes in the shape of patches rather than anything else. Additionally, x86 architecture is intrinsically vulnerable due to its complete lack of application / driver layer control. Thus it is that Windows OS's running on x86 are plagued with backdoors and vulnerabilities of all kinds. A deeper level of discussion would be needed here to delve into the reason why Windows is so much more vulnerable than other OS's. One of the reasons is x86 (Windows CE running on ARM is nearly invulnerable) but it could also be argued that backdoors are left open on purpose, in the same way that a virus can self install automatically without user intervention because it is in Microsoft's own interest (and some of its partners') to allow the system to install programs and daemons performing certain operations on the system that are beneficial to Microsoft only.

24 5.1.1 The Demise of the Home PC

It should be noted that the real demise of the Home PC may be the fact that it does not make any more sense to develop home PC's other than adding graphical features and visual enhancements (as Apple has been doing for some years now). One wonders at the fact that Windows based PCs are so complicated they become unpleasant to use. The antivirus software constantly running (and gobbling up CPU cycles, and spinning up CPU fans, and helping hard discs die an early death) is a concern in itself. It could also be argued that smartphones and tablets are selling so well because they are growing in oomph all the time – newer screens and extremely efficient multi-core CPUs mean there are many things they can do better than a home PC. But they are still designed with gestures in mind rather than desktop computing. The finance magazine Forbes argues that it is Microsoft's and Intel's fault. After all, often sales are market-driven and Microsoft and Intel do not seem to have been able to find how to motivate more people to buy PC's, whereas the likes of Apple or Samsung have.E.22.

5.1.2 Apple's dominance and the personal computer

Apple are not interested in manufacturing personal computers any more. It is partly Mr Jobs's fault because he decided it was not interesting to develop the PowerPC architecture any further, and Apple would rather use standard x86 hardware from then onwards. PowerPC architecture had been the big difference between standard PC hardware and Apple personal . As it is, nowadays an Apple personal computer is in itself a PC and its hardware is exactly the same as any PC's. Apple are still branding theirs (excellent design coupled to programmed obsolescence). For the moment Apple concentrate on selling iPads (still very successful at that) and iPhones. Apple's home computers ever since OSX was introduced are running an operating system which evolved from NeXTSTEP, itself based on the BSD Unix microkernel and a propietary window rendering system, and since this has been ported to x86, in practice there is no big difference between OSX and Linux. This also means that OSX should be as secure as Linux. OSX can also be bought separately and installed on some x86 PCs. This is a better road to follow than using Windows. Apple computers are still being seen as not quite the industry standard (indeed Linux is a bit more sometimes) and in any case not as attractive and easy to use as tablets and smartphones. Those two kinds of devices offer new functionality but there are still a lot of stuff that can only be done by a personal computer.

25 5.1.3 The Future of the Personal Computer

Many people think that the future of personal computing could well lie in ARM based devices running Linux. We are of the same opinion. Those would be endowed with power to spare, portability, and security. A wannabe commercial example would be Ubuntu's all-in-one: a smartphone that runs Android on the move and becomes an ARM based Ubuntu desktop computer when docked.E.23 Brilliant an idea as it is, manufacturers of desktop-only / tablet-only / smartphone-only devices (the big players like Apple, Samsung, Google, MS, and so on) are bound to oppose it and it might end up in oblivion sooner than we would all want. Smaller scale projects would be in the lines of the Raspberry Pi (cheap, inexpensive and flexible) or the BeagleBoard hardware, versions of which (eg BeagleBone) contain high- perfomance ARM coresE.24.

5.2 Early 1990s Up to the early 1990s there were several mainstream types of hardware and software:  Unix OS on mainframes

 BSD (a branch of Unix) on workstations (Sun Microsystems et al)

 MS-DOS and MS Windows on x86 workstations and home Pcs

 Mac OS on 68xxx workstations and home Pcs

 The home and hobby oddities – Atari and most importantly the Acorn Archimedes, ARM based. Acorn's RiscOS.  MINIX (see below).

 QNX, a real-time Unix-like OS for high-risk applications in the industry and medical sectors from the early 1980s

5.2.1 MINIX

Last but not least, MINIX was a Unix-like OS which was ported to virtually all hardware platforms. It was initially written for IBM PC systems available at the time but it later included support for MicroChannel IBM PS/2 systems and was also ported to the and SPARC architectures, so computer platforms like the Atari ST, Commodore Amiga, Apple Macintosh and Sun SPARCstation could run it. There were also unofficial ports to ARM and INMOS transputer processors. MINIX did look a lot like what Linux would be eventually – still it was not free, and it had been developed with university courses and research in mind.

26 5.2.2 Linux

Then along came Linus Torvalds. He was already using MINIX but it was not free software. Why not develop his own Unix-like OS for the x86 architecture and the relatively powerful (at the time) 80386 CPUs which most home computer users used to look up to as cutting edge PC technology. Torvalds needed to access the large UNIX servers at Helsinki University and had a new 386 PC. He wrote what was initially a specifically for the hardware architecture he was using, and it had to have OS capabilities already since it ran on its own. He developed it using the GNU C Compiler, still widely used choice for compiling Linux nowadays. Torvalds used the GNU C Compiler on MINIX to write Linux, just as UNIX had been fully (re-)written in C in 1973 except for device driver stuff. Linux was not a copy of MINIX, which thankfully has been publicly acknowledged by Tannenbaum himself, the author of MINIX – the latter has a microkernel whilst the former has a monolithic kernel with some advantages and disadvantages, especially when a new driver needs to be compiled into the kernel. Still, a Unix-like system is a Unix-like system, and it has to be acknowledged what an important part MINIX played in the rise of Linux. This is Linus Torvalds's famous 25 August 1991 post on the Usenet newsgroup "comp.os.minix.":

Hello everybody out there using minix - I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready. I'd like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things). I've currently ported bash(1.08) and gcc(1.40), and things seem to work. This implies that I'll get something practical within a few months, and I'd like to know what features most people would want. Any suggestions are welcome, but I won't promise I'll implement them :-) Linus ([email protected]) PS. Yes – it's free of any minix code, and it has a multi-threaded fs. It is NOT portable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that's all I have :-(. —Linus Torvalds E.28

27 5.3 1990s And so the hardware and software market continued to evolve. Silicon Graphics (SGI) continued to develop their sophisticated, expensive MIPS based CGI workstations for 3D rendering, whilst Pcs were relegated to domestic and office work. George Lucas was a visionary that revolutioned the film industry. He founded not only the film production company Lucasfilm, but also its division Industrial Light & Magic (ILM) in 1975, an American motion picture visual effects company. This was when Lucas began work on production of the film Star WarsE.29. The subject of a grid of computers working in parallel rendering graphics really turned up here, but more years had to go by before this would become a reality. Images are already being processed digitally in the 1990's, and already in 1985 stop motion was substituted by computer rendering for the first time in the movie Young Sherlock Holmes and the Pyramid of Fear. The film is remembered for including the first fully computer-generated photorealistic animated character, a knight composed of elements from a stained glass windowE.30. By the end of the 1990s x86 based SMP workstations appeared. They were running Windows NT with SMP support and started competing against large, expensive SGI like workstations. Hardware with huge efficiency potential meant that things started to happen fast. And so PC farms ruled out much more expensive Sun rendering systems at a fraction of the cost (Hollywood CGI effects). Later in 1993 the film Jurassic Park would set blockbuster standards when it came to full computer graphic simulations: Tippett created stop-motion animatics of major scenes, but, despite go motion's attempts at motion blurs, Spielberg still found the end results unsatisfactory in terms of working in a live-action feature film. Animators Mark Dippe and Steve Williams went ahead in creating a computer-generated walk cycle for the T. rex skeleton and were approved to do more. When Spielberg and Tippett saw an animatic of the T-rex chasing a herd of Gallimimus, Spielberg said, "You're out of a job," to which Tippett replied, "Don't you mean extinct?" Spielberg later wrote both the animatic and his dialogue between him and Tippett into the script, as a conversation between Malcolm and Grant. As George Lucas watched the demostration alongside of them, his eyes began to tear up. "It was like one of those moments in history, like the invention of the light bulb or the first telephone call," he said. "A major gap had been crossed, and things were never going to be the same."E.31

5.4 2000s: Render Farms And so it happened that SMP workstations started to become ubiquitous, and so did 100Mbps network cards, the 100Mbps standard being adopted in 1995. All of a sudden it

28 started to make sense to fill up rooms with ordinary Pcs (cheap) running Linux (free) speaking to each other on 100Mbps EtherNet and sharing Herculean tasks by breaking them up in little bits and distributing as evenly as possible the workload between the Pcs on the farm. Hollywood studios had been using top-notch, cutting-edge, expensive SGI equipment, and all of a sudden it started to make sense to switch over to much cheaper (and sometimes even higher performance) Render Farms, typically running OpenMOSIX on a Linux kernel to distribute workloads throughout the cluster.

5.5 Beowulf as a Valid Alternative Nowadays reasonably priced blade servers / multicore CPU based computers can be found. Still, a Beowulf cluster can be built which outperforms a newer multicore CPU based cluster. For example, a 12 Pentium4-node cluster can outperform an AMD-Turion 2- node cluster in many instances. And the 12 Pentium 4 nodes do not even need to be paid for – they are absolutely free, being recycled instead of discarded or sent to third world countries, where most of them will be immediately discarded due to lack of support (this is a different story but one that is well worth telling as it is not only unethical but it defies rational thinking). Making Beowulf a household word might prove to be a little more difficult, but the word is gaining ground at a steady pace.

29 Chapter 6

The Future

30 6 The Future This Project deals only with Beowulf cluster construction and architecture by using completely free and keenly low priced equipment. However computer clusters come in all shapes, and our Project is utilising a piece of hardware which is very similar in essence to what is to be found in modern high power supercomputer clusters ie Mont-Blanc.

6.1 Mont-Blanc When it comes to raw computing power it is easy to choose x86 based hardware as it is conveniently powerful out of the box. However, the amount of heat generated by such a system can be huge, as can be its power consumption. Those two issues mean (i) the system will be more expensive to run and (ii) it will be paramount to optimise system design based on cooling needs. Enter high power ARM CPU clusters, e.g. Nvidia Tegra technology. A Tegra CPU comprises 4 ARM CPU cores plus one low-consumption, 500Mhz "companion core" for simple and stand-by tasks, teamed up with one Nvidia GeForce GPU. According to Alex Ramirez, leader of the Mont-Blanc project, this allows for up to a tenfold reduction in power consumption (their 2014 goal). GPUs, also used in x86 based supercomputers, are power hungry but needed for serious number crunching. However in x86 systems CPUs themselves gobble up 40% of the power whereas the very efficient 4+1 ARM core architecture in Tegra allows for outstanding power consumption reduction. Now enter the Tegra's little sibling, the SoC used on the RaspberryPi. This consists of a 600-to-700MHz (overclockable up to 1GHz) ARM CPU together with a Broadcom GPU. The SoC on the RaspberryPi is Broadcom's BCM2835 (even if /proc/cpuinfo talks about BCM2708) and the GPU is clocked at 250MHz. This is a Videocore 4 GPU and is present on most bottom-range Samsung mobile phones like the Galaxy Mini. It would be extremely interesting to use the Broadcom GPU in the Pi's SoC in the future in order to make a RaspberryPi cluster more efficient. However its drivers have not been made available by Broadcom (all is closed source unfortunately), and Raspbian does not have an Adobe Flash package with proper , rather the completely unusable Gnash plugin.

31 6.2 UPC's Scientific, Technical and Educational Training The reader has been introduced to basic system architecture history in the lines of STEM. One of the reasons why is that likewise, the UPC's ICE held a training seminar from 5 Feb until 8 March 2013 entitled "Formacio en activitats cientifiques, tecniques i didactiques. Supercomputacio." to encourage students to take up studies in science and technology, thus in the lines of STEM.

32 Chapter 7

The Raspberry Pi

33 7 The Raspberry Pi A short chapter is dedicated to this small computer since it is so relevant to this project and to Teaching. Pi was introduced on 29 Feb 2012 at US$35 for the Model B (2 USB ports and an Ethernet socket) and US$35 for the Model A (just 1 USB port and no Ethernet port). It comprises an ARM1176JZF-S CPU, 256MB of RAM (up to mid 2012) or 512MB of RAM soldered directly onto the CPU, and a few I/O devices that turn it into a real computer. The Pi can be used as a 'headless' computer or as a fully fledged desktop computer. In order to use it as a computer, just plug a USB keyboard and USB mouse into its USB ports, and a monitor into its Illustration 7: Diagram of a Raspberry Pi.E.18 HDMI port. An output jack socket is supplied for stereo audio. However in order to use it as a 'headless' computer just a power cable and an EtherNet cable are needed. (It is envisageable to use a WiFi USB dongle instead of the EtherNet cable as well although in that case drivers for the dongle must be installed and the right dongle chosen). On this project our RaspberryPi's shall be used as headless computers. Access will be gained by logging into them by using ssh client terminals, be it a Linux PC or a Windows PC running Putty, the free ssh client for Windows. See 13.2 . Raspbian Wheezy shall be installed. This is the Linux for this particular ARMhf arhcitecture, since most of its rough edges have been smoothed off for us, Illustration 8: Our nicely clad Raspberry Pi. with a fully searchable package repository among other advantages.

34 Also the Raspberry Pi Foundation has made available an installer image for this distro which automatically sets up the SD card with its 56MB FAT partition (used for the bootloaderE.19) and its 2GB ext4 partition for the OS which shall then be extended to a full 8GB so that there is always sufficient space on the Toshiba 8GB SD card. At an extra cost, a clear, see through case will be purchased for each of the Pi's so that it is a bit better protected from the outside world. The Broadcom Videocore4 GPU is built-in and capable of up to 24GFlops. However it is not at present straightforward to use it for our cluster calculations.E.20 There is only a framebuffer interface for display purposes. There is no OpenCL and no plans for it nor is there documentation available to create OpenCL. Once an OpenGL driver becomes available we may be able to engineer some calculations via the GPU but how useful that will be remains to be seen.E.21

Illustration 9: A Raspberry Pi Software Block Diagram.E.20

35 Chapter 8

A Heterogeneous Cluster

36 8 A Heterogeneous Cluster Our cluster is very much a heterogeneous one, and load balancing techniques are well established for homogeneous systems in what is basically symmetric multi processing. Parallelisation software is by default told to optimise execution for a number of relatively well-known, or at least similar, machines, so processing power on the cluster is maximised. However, when a task needs to be broken up into many sub tasks and each computing node is completely different, performance issues can easily be encountered. Our Beowulf cluster will help understand (and potentially study) how and why this happens and how beneficial it is to use a particular node. Also, it will be possible to run tasks on our Cluster sending them out to the particular nodes desired.

8.1 Symmetric Load Balancing This means equivalent tasks will be sent to each node on the cluster. There are different 'standard' approaches when balancing loads on a heterogeneous Cluster.

8.1.1 Finest Grain: the Coimbra approach

At the University of Coimbra, Brazil, a project was undertaken where compute nodes (in actual fact workstations) were 'donated' dynamically. Since the computational power of each node is not known from the start, the problem is then reduced to the finest grain and never expect a worker to execute more than one simple task at a time. This means a system will suffer from any communication latency there may be in it. Therefore a 'minimum' task should be defined so it is optimally executed by the least powerful node.

8.1.2 Benchmarking nodes: the HINT benchmark

Researchers at Brigham Young University use compute nodes donated from within the university. They execute the HINT benchmark on each node and store the results which will be used in order to fine tune the application and the parallelising software. This thus takes us to static load balancing. Disadvantages are: it is computationally expensive to prepare, does not scale well, and does not port easily either.

37 8.2 Asymmetric Load Balancing This is a project in itself. There are many factors to be considered, namely the features of each node: – Memory

– Rough processing power: BogoMIPS figures.

– integer vs floating point processing power

– Efficiency of memory allocation

– speed of communications

– and so forth

8.3 Quibbles The performance of the cluster with a slow node may be worse than the performance achieved without using that processor at all. Adjusted load balancing can help us avoid that pitfall although the process is seldom straightforward.

8.4 Potential for Further Research Modern devices like newer smartphones are endowed with ARM's BIG.little technology. BIG.little heterogeneous technology means on a multi-core SoC there are fast cores and slow cores. The slow ones are used preferably when there are no complicated tasks to perform so that the phone saves battery by not using the faster cores at all at given moments. Samsung Galaxy S IV phones are using a particular setup. Their Exynos 5 Octa SoC consists of 8 ARM cores: 4 1.6GHz fast cores and 4 1.2GHz slow(er) cores (apart from a tri-core 480MHz GPU). Even if the system must have been designed for ease of balance, this is very much a system on which asymmetric balancing must be carefully sorted out. Samsung did not start shipping Galaxy S IV's with the Exynos 5 Octa chip from the start and the likely reason was there was a good deal of asymmetric balancing fine tuning needed before optimising their system and dazzle the audience with possibly the fastest and best smartphone ever. Interestingly, at the moment the preferred solution is simple switching. This probably simplifies kernel tweaking. At any given time, either the big cluster is active (for maximum performance) or the LITTLE cluster (for maximum power efficiency). The switching of the active cluster is performed on demand by the chip firmware, with the migration of the

38 running software from outgoing to incoming cores taking 20 us. The CPU emulates a single quad-core processor to the operating system and applications. Officially, Samsung claims that Exynos 5 Octa can consume up to 70% less power than the Exynos 5 Dual (see E.27).

39 Chapter 9

HBC Overview

40 9 HBC Overview HBC is the generic acronym given to this Hybrid Beowulf Cluster.

9.1 Network Topology In this project, the network will be a subnet within a larger network, the latter being that of the company or institution the HBC can be installed at. In order to run the system devised for this Project our home subnet will be used. Two EtherNet switches, a Conceptronic eight-port switch that was tucked away in a drawer and a new brandless four-port switch purchased on eBay, will network the cluster, and our home residential gateway, a simple D-Link DSL-2740R unit, will allow the user to access the system remotely.

9.1.1 Head / Master Node: RaspberryPi

The roles attributed to the machine running the cluster. Access will be gained to the Beowulf cluster using this machine if ssh'ing into it from the outside. The sshd server running is listening to port 80. This machine is running an Email server informing of its status once a day.

9.1.1.1 RaspberryPi99

This machine is a Compute Node. It is named after its static IP address. The residential gateway forwards rdp requests to it through the standard rdp port 3389.

9.1.1.2 Rest of Nodes

X86 hardware running Ubuntu 10.04.4. There are PentiumIII's and a couple of AMD's. More on all those later.

9.2 Software

9.2.1 Raspbian OS on Master and Compute Nodes

The Linux distro of choice for Raspberry Pi, Raspbian Wheezy (namely Debian Wheezy on RaspberryPi) has a mostly complete, working kernel for the device. This will allow us to to match Beowulf Master node software with Beowulf Compute Node software. Master Node will be made to communicate with Compute Nodes; this will mean making some adjustments so that libraries in charge of passing messages (the MPICH2 protocol) speak to each other.

41 9.2.2 Ubuntu 10.04.4 OS on Compute Nodes

A LTS (Long Time Support), stable, simple and powerful Linux distro for pre-SSE2 x86 hardware (i.e. Pentium III and up). It is ideal for running Beowulf Compute Node software on PC hardware. Deemed as the 'ideal distro' by Ubuntu themselves (see E.32).

9.2.3 Rest of software needed

(NFS) on all nodes.

 Secure Shell (SSH) on all nodes.

 Message Passing Interface (MPI) – MPICH2 on all nodes.

 MPICH2 is a high-performance and widely portable implementation of the MPI Standard, designed to implement all of MPI-1 and MPI-2 (including dynamic process management, one-sided operations, parallel I/O, and other extensions).  Users running the distributed tasks: A special user replicated on each node will be used. In the end it has been chosen to abandon the use of PXE network boot where possible / available so as to be able to configure each machine separately. This is due to the fact that our HBC is deeply heterogeneous. It will also be more time consuming to set up the cluster and each node but will have several advantages, one of them the fine tuning of each node.

42 Chapter 10

Hardware. Building the System

43 10 Hardware - Building the System

10.1 Picking up the Bits The company the writer of this dissertation is working for were throwing out some old PC's that still worked, most of them devoid of essentials like RAM (or bits of it), network cards and so on. However they were also throwing out some other PC's which did not work at all but were complete with essentials that could be installed on the PC's with the missing bits. In particular the PC's that were been thrown out and did not work at all were HP Vectra PC's with Pentium III Katmai processors. Those were 500MHz processors (max FSB speed 100MHz). They also had working network cards, odd bits of RAM, small IDE hard discs and other interesting bits. The PC's with the missing bits were HP Kayak XM600 workstations complete with 3D OpenGL graphics cards. One of them did not have any memory, another one no network card, another one no hard disc, and so on. Each Kayak XM600 had an existing Pentium III Coppermine processor. 500/100 Katmais with 600/133 Coppermines cannot be mixed. That is why the decision was made to install four Coppermines in two Kayaks and then two Katmais in one of the Kayaks. The Kayak with two Coppermines and 512MB of RAM will be configured with an xfce4 graphics environment and full terminal functionality so that it can be used to access the system locally as well as a compute node if so desired. Some existing equipment used on the project was already at hand and some more equipment was purchased on eBay.

10.2 The Dell Poweredge 1500sc Server This wonderful computer came with a single 1.1GHz PentiumIII processor and 256MB of RAM only. It is equally a shame that video memory is a measly 4MB only. Then a pair of DIMMs was purchased, one GB each, so that the computer's memory is now 2.256GB – quite a Illustration 10: The Dell Poweredge on powerstation. the way to its destination. A further decision to to upgrade it with a second processor was made but an

44 identical twin could not be found. Therefore two 1.13GHz PentiumIII processors and one more heat sink had to be purchased. Good old Dell server hardware is still fully documented on the Dell support Website. The PDF documents were essential in order to set up the hardware. First the first, 'newer' 1.13 Ghz was installed where the old 1.1GHz CPU was. The heat sink was removed off the CPU, some thermal paste applied on the new CPU, it was then carefully dropped in the socket, then the clip was put on and secured and fastened carefully so as not to exert too Illustration 11: Inside the Poweredge much force on the board and socket. 1500sc. A well made piece of hardware, and a heavy one too. The system now needs to be started so that it is made aware a new CPU with a different clock is available. Then installation of the second CPU can proceed. Before doing so the additional VRM (Voltage Regulator Module) must be installed in its slot. This is a straightforward procedure, the board and the slot being sturdy enough. In order to install a second CPU, the dummy is removed off the second socket and the second CPU put in it. After that the additional heat sink purchased is placed on it with its clip. When all is in place, the Memtest set of tests is run to make sure the second hand memory boards are fine and some persistent, worrying errors are encountered. This turns out to be due to the fact that the built-in USB 1.1 firmware driver uses some bits of the memory that are actually left out by the system but which conflict when Memtest is running. After that all is well. Illustration 12: Updating the Poweredge's BIOS. 10.3 The 3 HP Kayaks The HP Kayak was a mid end server made for the European market. These units were manufactured in France. E.17

45 There are 3 HP Kayak XM600 workstations on the cluster. Each came with a single 600 Mhz PentiumIII Coppermine CPU. Some of them had no memory, others only 256MB, 128MB. Those were topped up with odd bits available and donated elsewhere, and bits bought on eBayA. Thankfully, good old HP server hardware is still fully documented on HP's support Website. The PDF documents were essential in order to set up the hardwareE.17. Installing a second CPU is simple. Its PentiumIII Coppermines come in cards, Illustration 13: A Pentium III Katmai and there are slots on the motherboard. processor. Left to right, assembly clip, Just remove the dummy and push in the rear plastic cover, clip, radiator and main processor board with processor and cache. PIII card until it clicks in place. The CPU comes with its own heat sink riveted to the card.

10.4 The Webgine 1115XL laptop Hardware

This is an old computer we already had in stock with only 256MB RAM which makes it unusable for most everyday graphics server tasks, either on Windows XP or Ubuntu, and upgrading it to at least 512MB cost at the time over 35 euros so it was never done. However as a compute node this machine is fairly interesting as it has a PentiumIII Coppermine CPU running at 1.1MHz. Ubuntu10.04.4 was installed on it with no further problems.

10.5 The HP Pavilion AMD64 based laptop

This is an excellent dual core 64 bit AMD Turion laptop with 1GB RAM and 1.6GHz CPUs, of course with 3dnow. This will turn out to be the very best machine on our cluster and one that is widely used for high performance, low power, low footprint beowulf clusters. This machine is virtually unusable due to its being broken, apparently a common fault on this hardware; it becomes loose from the main board and ceases functioning. However by physically manipulating the computer we managed to start just once the video card so access to the BIOS could be gained and change the settings so it would boot off a USB drive.

46 Thus this computer happily boots off a SanDisk Cruzer 16GB USB drive. Installing the software was straightforwardly done on another computer by using a Ubuntu10.04.4 installation CD on another machine by it off this installation CD and then instructing it to install the system and GRUB (the boot program) on the USB stick rather than anywhere else. A special package for SSD optimisation was installed which ensures no speed loss is derived from the fact the whole system will be running off the USB stick. At the moment, for compatibility purposes a 32 bit Ubuntu10.04.4 system is running on this computer It is however envisageable to further fine tune the system by installing a fully fledged 64 bit Ubuntu10.04.4.

10.6 The AMD Athlon based brandless desktop

This computer came without a hard drive or a CD drive. However its AMD processor already had a special fan, albeit its speed is manually controlled through a potentiometer endowed PCI card. A surplus 8GB IDE HD was then fitted which is small in capacity but reasonably fast. In order to install Ubuntu10.04.4 an old, slow CD drive which would not fit in the box was installed. The graphics card refuses to display Illustration 14: The AMD Athlon anything much when in graphics mode brandless desktop. Fan assembly at the top left, under the Conceptronic Ethernet (dirty, 256 colour only images) but we will switch. only ssd into it.

10.7 The Raspberry Pi

Hardware wise this is the simplest piece of kit. Just get the SD card and stick it in. Before doing so, though, Raspbian was downloaded from the Raspberry Pi Foundation Website. The card was then inserted on a Ubuntu powered desktop and Gparted was used in order to enlarge the main partition (ext4) so that it would fill the whole 8GB available on the SD card. And we then started it on the Pi.

47 At first the older image was used, the one without the special, updated firmware that allows Raspbian to detect our newer Pis' 512MB of RAM. This was found out by having a look at cat /proc/meminfo and so the newer version was found, downloaded and installed. The newer firmware had to be obtained for the additional RAM to be seen and made usable. Running rpi-update would update the firmware to handle 512MB boards. We had to manually download rpi- Illustration 15: Raspberry Pi's raspi- update then install it as described at config helper. https://github.com/Hexxeh/rpi-update. However as of now there is an excellent NOOBS package (New Out Of the Box Software) which does everything automatically: http://www.raspberrypi.org/archives/4100 For a detailed list of hardware and its origins see chart in Appendix A.

10.8 Communications Only simple 10/100Mbit EtherNet switches will be used: 1x 8 port switch, 1x 5 port switch, and 1x 4-port ADSL residential modem/router/switch

10.9 Physical Layout: Positioning the PC's It has been chosen to stack the PCs up so that they do not take up too much space but also making sure there is enough ventilation. This is likely to be a nice heater in the winter but it could potentially turn into a blazing furnace in the summer.

The photos of HBC are in the next page.

48 Illustration 16: The two Raspberry Pi's on HBC and the residential gateway. An external HD for backup purposes.

Illustration 17: The x86 machines on HBC.

49 Chapter 11

Network – Building the System

50 11 Network – Building the System

11.1 The Downton Cluster Our two little daughters were reasonably intrigued at the pile-up of hardware in the study and they became thus involved in the project. After a little chat it was decided that they would name the x86 compute nodes with characters in a popular ITV series this family likes very much: Downton AbbeyE.43. Some of the names were chosen rather at random, however the oldest computer (PentiumII Klamath Vectra VL) was named Cora and the tall, wide and black computer (Dell Poweredge 1500sc) was named MrCarsonE.43.

11.2 IP addresses and node names Those names shall be used the names to identify the computers on the network by creating a /etc/hosts file that looks like this: 127.0.0.1 localhost 192.168.0.9 presario-desktop #computer on another subnet 192.168.0.101 raspberrypi #master node 192.168.0.99 raspberrypi99 #pi compute node 192.168.0.21 cora #HP Vectra VL PentiumII Klamath 192.168.0.201 daisy #brandless AMD Athlon desktop 192.168.0.202 anna #Gericom Webgine PentiumIII Cop. 192.168.0.203 mrshughes #HP Pavilion dual core AMD Turon 192.168.0.204 mary #Kayak 512MB 2x PentiumIII Cop. 192.168.0.205 mrcarson #Dell Poweredge 2x PentiumIII C. 192.168.0.206 edith #Kayak 384MB 2x PentiumIII Kat. 192.168.0.207 sybil #Kayak 384MB 2x PentiumIII Cop. Any references to Ipv6 shall be commented out. On every node /etc/hostname was changed to the computer's given name.

51 Chapter 12

Software – Building the System

52 12 Software - Building the System

12.1 Previous Considerations For worldly purposes, Debian Wheezy 7.0 (aka Raspbian) has ben chosen for the RaspberryPi's and Ubuntu10.04.4 for the x86 based Pcs. Those two distributions are the ones that are most suited and better supported (in the shape of available packages, up to date repositories, hardware support, etc 9.2.2 ). It should be noted that at the start of the project it was thought best to use the Intel PXE network boot-up system in order to automatically configure the whole system (we thought we might have to leave the Pi's out) without having to work on each particular node. One of the ways to go was Rocks plus PXE. This would not be problem free and have probably led to hidden problems, although probably faster than doing everything by hand. Alas, in practice this could not work given the deeply heterogeneous nature of our system. Get root, as the saying goes, get your CD drives and Ubuntu 10.04.4 CD installers, get your hands on new 8GB SD cards for the Pi's, and do everything by hand albeit in an organised, pre optimised way.

12.2 General Procedure on PC's The normal Ubuntu10.04.4 installation CD was be used. On systems with less than roughly 280MB of RAM (like Anna) the Ubuntu10.04.4-alternate installation CD was used. This operates with no graphical environment thus keeping the memory footprint during installation at a minimum and thus much speeding up things.

12.2.1 Partitioning

Ubuntu's original suggestions shall be followed – just let it do what it wants to the target hard drive. After that, it must be instructed to install GRUB on the drive.

12.2.2 Reverting to Old GRUB

This is not completely necessary but can prove useful as the new GRUB shipped with Ubuntu10.04.4 boots directly into graphics mode ie gdm daemon running. By reverting to the old GRUB, the /boot/grub/menu.lst can be easily edited and change it with a simple text editor like nano. After doing so the main boot line can be edited in order to remove 'splash' and add 'text' to its end. This way no graphics environment will be started.

53 Depending on the hardware it can be of use to add 'noapic' or 'nolapic' to the main boot line (especially on Kayaks) to make sure the system boots in a more stable way and PCI cards work well, this disables Intel's APIC ie Advanced Programmable [Local] Interrupt Controllers which are buggy and supposed to be meant for symmetric multi processor systems. Mary will have a fully fledged xfce4 graphics environment and sticking to GRUB2 can be of some help in order to start up the glint powered, old graphics card's funcionality as it wishes to communicate with the monitor. However on the other PC's the classic version of GRUB will be used.

12.2.3 Intel Microcode

This package was installed on all Intel branded CPUs but not on the two AMD machines namely MrsHughes (AMD Turon) and Daisy (AMD Opteron). However in order to run High Performance Linpack benchmarks (more on this further on) libatlas3dnow will be installed on those two to make the most of AMD's built-in 3dnow hardware, and libatlas-sse on the Intel machines so as to get the most out of its SIMD instruction set.

12.3 Installing MPICH2 MPICH2 is a package of libraries designed for seamless task sharing among the nodes of a cluster. It also needs Hydra for establishing communications, which is more or less a part of MPICH2. Therefore compatible versions of MPICH and Hydra on Ubuntu10.04.4 and on Raspbian will be installed so as not to experience any problems with message passing protocols. Current Raspbian depends on firmware on Raspberry Pi and it is kernel v3. However Ubuntu 10.04.4 is kernel v2. At the moment of writing, extra care should be used if attempting to set up a Beowulf cluster in the lines of this Project since Raspbian's mpich2 has changed substantially and the current, newer MPICH2 libraries (higher than 1.4.2) are no longer compatible with Ubuntu10.04.4's ones (1.2.1). Using the following Debian packages, still available from the Debian server, is recommended. In the event those should be difficult to find, we keep a copy which we would be happy to share. mpich2_1.4.1-4.1_armhf.deb E.40 libmpich2-3_1.4.1-4.1_armhf.deb mpich2-doc_1.4.1-4.1_all.deb libmpich2_dev-1.4.1-4.1_armhf.deb mpich2python_2.8-4_armhf.deb Please note that 'hf' means 'HardFloatingPoint' on ARM i.e. this will ensure the CPU

54 is using its own floating point unit. Thus these libraries are optimised for our Raspberry Pi hardware platform. MPICH2 will be installed from the Ubuntu Repositories on x86 machines. This is indeed possible on Ubuntu10.04.4. The MPICH2 version installed will be 1.2.1 but it is still fully compatible with Raspbian's 1.4.1. Hydra will have to be installed separately though by compiling it from the source. This is what we will be downloading and compiling: http://www.mpich.org/static/downloads/1.4.1/hydra-1.4.1.tar.gz The 'partner' repositories on the machines must be enabled. On x86 nodes the official Ubuntu10.04.4 mpich2 package mpich2-dev then openmpi-bin are installed, and on top of that, hydra-1.4.1.tar.gz: gunzip and tar then ./configure and make. When that is finished, the three freshly compiled files: hydra_nameserver hydra_persist hydra_pmi_proxy are copied to /usr/local/bin on Ubuntu10.04.4. On Raspbian, the files will be in/usr/bin. It can prove useful to place the executables in an NFS share and access those from every node later on so those are now copied to each node's /usr/bin and /usr/local/bin as it would not make much sense to re compile Hydra on each x86 node. mpich2version gives us 1.4.1 plus a wealth of additional info. For reference purposes, the source code of mpich2_1.4.1 is here: http://www.mpich.org/static/downloads/1.4.1/mpich2-1.4.1.tar.gz

12.3.1 Creating specific user hbcuser

All MPI users must have the same username and user ID. This is because access must be given to the MPI user on the NFS directory later, and permissions on NFS directories are checked with user Ids. The MPI user shall be called hbcuser, ID 999, and password beowulf. This user will be created on each node. Note that by giving an ID below 1000 the user will not appear on the Ubuntu login screen.

55 12.3.2 Installing NFS on master and compute nodes

nfs-kernel-server on the master node is installed and we make sure it is up and running by checking with service nfs-kernel-server status. nfs4 needs to be installed on each of the compute nodes, both x86 and Pi's. However this is the package nfs-common as there is no need for a server. A directory on the master node that will be exported to all nodes must then be created. We shall call it /hbc. It needs to be owned by hbcuser so chown hbcuser:hbcuser /hbc. The directory must be 'exported' via NFS to the network. In order to do so, the Debian instructions are followed and the following line needs to be added to the file /etc/exports: /export/hbc 192.168.0.0/24(rw,nohide,insecure,no-subtree-check,async) so that the directory /hbc will be exported ie available to the rest of the network for mounting. After that the nfs server needs to be restarted. On each node an /hbc directory has to be created, chown hbcuser:hbcuser /hbc, and will mount the directory like this at startup: mount -t nfs4 raspberrypi:/hbc /hbc

12.3.3 Ensuring passwordless communication between nodes

An RSA key now needs to be generated on all nodes having logged in as hbcuser. The key shall be passwordless. This will ensure each node can be accessed (ssh'd) with no need for a password. The command ssh-keygen -t rsa is used, leaving the passphrase empty. Now all nodes have a SSH key. The master node needs to be able to log into the compute nodes. A copy of the SSH key of the master node will be placed on each node. Running the command ssh-copy-id [node ID/IP] as user 'hbcuser', then successively mary, edith, sybil etc, the master node shall be able to log into each compute node seamlessly.

12.4 Setting up for hybrid operation This is vital for the HBC to run as executables will be different on ARM and x86 machines. Therefore, programs will have to be compiled separately for ARM and x86. X86 executables will be on x86 nodes and ARM executables on ARM nodes. However, the directory that is being shared across the cluster, /hbc, will contain the programs and data that will be shared across the network, and each node must be able to fetch its own architecture-specific executable rather than an executable placed in the directory /hbc. For that purpose, instead of calling an an executable on the mpiexec commandline

56 which would be shared across the cluster, a short script will be invoked and in it the real executable on each node will be pointed to. A directory called /hbcrun is created, which is then chown hbcuser:hbcuser. In this directory architecture-specific executables will reside. Thus as pointed out, when the program to be fed to all nodes on the master node is launched, only an sh script will be invoked, pointing to the 'real' executable that will reside in directory /hbcrun on all nodes (see below).

12.5 Node-specific executables on each node The main commandline for starting the MPI cluster program is mpiexec. It takes two main arguments: machinefile and the program itself to be executed. mpiexec -f machinefile -n x ./example.sh where example.sh will be /hbcrun/./example

12.6 A few quibbles Indeed. One of them being inadvertently updating mpich2-1.4.1 to 1.4.5 whilst updating other packages on the Raspberrypi master node (after running apt-get update). Unfortunately, 1.4.5 is not compatible with 1.4.1 and apart from using RealOTE (which got installed on the master node but was not on the rest of nodes) has different commandline parameters and seems to communicate differently with compute nodes and read programs differently (it would refuse to execute a script as 'the' cluster program which we absolutely need in order to make HBC work). As a result, nothing worked any more. Backtracking was direly needed so as to reinstall mpich2-1.4.1 and hydra2-1.4.1 manually, and in order to do that it was finally decided to copy the .deb files involved to save time. It does prove interesting to save all those apparently useless .deb files. They were on Raspberry99's archive (note the HF for HardFloatingpoint): mpich2-1.4.1-4.1-armhf.deb libmpich2-3-1.4.1-4.1-armhf.deb mpich2-doc-1.4.1-4.1-all.deb libmpich2-dev-1.4.1-4.1-armhf.deb mpich2python-2.8-4-armhf.deb

57 Chapter 13

Accessing the System

58 13 Accessing the System

13.1 Locally Mary (the HP Kayak bottom right in the photo) will be the system's Terminal. It is reasonably powerful and has a fully fledged xfce4 graphics. It will also belong to the Beowulf Cluster as a node though, working as a compute node as well at the user's discretion. Therefore, Mary will have its own PS/2 keyboard, mouse, and analogue monitor.

13.2 Remotely If the user is not located where the HBC is located but rather somewhere else and in possession of a terminal connected to the Internet, they will still be fully able to run the HBC. Master Node Raspberrypi has a no-ip2 updater so that access can be gained to it remotely. Thanks to an xrdp server running on it, a user can also control the system remotely by means of a Windows based computer (either Remote Desktop client or Putty ssh client), a BlackBerry smartphone (inexpensive RDP client), and an Android smartphone (standard, free ConnectBot ssh client). For security and balancing purposes, – RaspberryPi will run the xrdp server on port 3389 (standard port number),

– and RaspberryPi99 will run the sshd server on port 80. We chose port 80 so that the system can be accessed from the outside even if some ports are blocked at the location we are at. Thus we will be able to access and run the system as follows: – BlackBerry runs the RDP client which allows us to log into our RaspberryPi which will be running the xrdp server. – Android runs the ssh client 'ConnectBot' which acts just as a client Linux box,

– A Unix or Windows box can also connect with (x)rdp or ssh clients.

13.3 Setting up each node (ARM, x86) It is assumed that the master node is already active on the network before compute nodes are started. In some instances the OS does not load the nfs module which is needed for NFS functionality (ie mounting NFS shares). In order to make sure the nfs client module is

59 loaded, and the shared directory is mounted, the following two lines need to be added to the /etc/rc.local script on every node the lines: modprobe nfs mount -t nfs4 raspberrypi:/hbc /hbc

13.4 Worldly Considerations – On / Off In order to start the nodes on the system first the multiple power sockets with switches will be powered on, then each node. Each node needs to be configured so that the POST (Power On Self Test) in its BIOS is configured to make it boot the OS even if it cannot find a keyboard, mouse, monitor etc. However in order to turn them off we cannot do so just by pushing the off switch as we actually need to shut them down. In fact this could be achieved by making sure ACPI is installed on each node and that Power Options are such that by pressing the power switch the node would actually start the shutdown procedure rather than switch off instantly, but this would still involve pressing each node's power switch manually and individually. Permissions need to be changed for the 'poweroff' and 'reboot' commands on each compute node: sudo chmod a+s /sbin/poweroff && sudo chmod a+s /sbin/reboot so that by issuing a script command on the master node, all compute nodes shut down. The script will be called /hbc/hbc_poweroff.sh and we will also write similar /hbc/hbc_reboot.sh and /hbc/hbc_hosts_alive.sh scripts. See C.3.

60 Chapter 14

A Test Run

61 14 A Test Run

14.1 Choosing an Example In the mpich2-1.4.1 package which we can get at http://www.mpich.org/static/downloads/1.4.1/mpich2-1.4.1.tar.gz contains in its directory mpich2-1.4.1/examples some programs in c are supplied. Those programs need to be compiled so that they run on our ARM architecture based nodes and on our x86 based nodes. In order to do so, an MPI compiler is also provided with the mpich2 package. This is the mpicc compiler. The following command is issued: mpicc -o icpi-info.arm srtest.c on the ARM machine. Then change permissions chown hbcuser:hbcuser icpi-info.x86 and place the executable in /hbcrun, naming it plainly: icpi-info Now copy it on the second ARM node too (RaspberryPi99), for example by using the /hbc shared directory and copying it from it. Next the same process is followed on an x86 node: First issue the command mpicc -o icpi-info.x86 icpi-info.c Then change permissions chown hbcuser:hbcuser icpi-info.x86 and place the executable in /hbcrun, naming it plainly: icpi-info On the Master node, RaspberryPi, we need to create a simple script called /hbc/icpi-info.sh containing: /hbcrun/./icpi-info Then chown hbcuser:hbcuser icpi-info.sh and chmod +x icpi-info.sh Other examples like 'pmandel' can be compiled. In this case the mpicc compiler needs

62 to be prompted to use the math library. This will work on both ARM and x86: mpicc -o pmandel.arm -lm pmandel.c mpicc -o pmandel.x86 -lm pmandel.c

14.2 Running the Example The text file /hbc/machinefile contains the nodes we wish to run the programs on, in the right order. Our machine file will look like this: #raspberrypi #raspberrypi99 #mrcarson:2 #mrshughes:2 #mary:2 #edith:2 #sybil:2 #daisy #anna #cora

We will nano the file in order to edit it manually. Nodes with two cores need the colon and the number of cores immediately next to them. In order to use a particular node, it must be un-commented. In order to use just a particular set of nodes, they need to be reorganised. Mpiexec starts reading from the start so the nodes we wish to use need to be the first in the list. The rest can just stay there, commented. On HBC there are 15 'compute nodes' in total, ie the exact number of CPU cores available on our system. MPI2 will break up tasks into subtasks and each of those will be sent to one of the CPUs. Now we will run a simple program on all those nodes: mpiexec -f machinefile -n 15 ./icpi-info.sh The output on the terminal is:

Process 0 of 15 is on raspberrypi Enter the number of intervals: (0 quits) Process 2 of 15 is on cora Process 5 of 15 is on mrshughes Process 1 of 15 is on raspberrypi99 Process 13 of 15 is on daisy Process 3 of 15 is on mrcarson Process 11 of 15 is on sybil Process 7 of 15 is on mary Process 9 of 15 is on edith Process 6 of 15 is on mrshughes

63 Process 4 of 15 is on mrcarson Process 12 of 15 is on sybil Process 8 of 15 is on mary Process 10 of 15 is on edith Process 14 of 15 is on anna 400000000 pi is approximately 3.1415926535897993, Error is 0.0000000000000062 wall clock time = 12.450598 Enter the number of intervals: (0 quits) 0 hbcuser@raspberrypi /hbc $ At this moment in time we are not quite sure whether the Pi calculation is quite right yet but what has indeed been shown is one process has been running on every processor on every machine listed in the machinefile. As mentioned before, the user can also choose to run the program on less nodes, say only on the two RaspberryPi's: mpiexec -f machinefile -n 2 ./icpi-info.sh for an output of: Process 0 of 2 is on raspberrypi Enter the number of intervals: (0 quits) Process 1 of 2 is on raspberrypi99 400000000 pi is approximately 3.1415926535895258, Error is 0.0000000000002673 wall clock time = 77.663177 Enter the number of intervals: (0 quits) 0 hbcuser@raspberrypi /hbc $

64 Chapter 15

Feeding Mammoth Tasks to the System

65 15 Feeding Mammoth Tasks to the System

15.1 Previous considerations There are many programs available for compilation that calculate molecule structures and other data. Those can be compiled on our system and see how fast it solves the problems. However, it seemed more interesting if to start with a benchmarking tool was compiled. This benchmarking tool can be either 'thrown at' the system or fine tuned so that the best results are obtained. This is exactly what would have to be done when feeding real- world tasks to the system. In this Project we have done both and will be comparing the results on every case and in every circumstance.

15.2 Unoptimised Linpack benchmarking tool It should be noted that unoptimised ATLAS libraries vs their hardware-specific optimised versions make a big difference in benchmark results. Regarding Raspberry Pi, sometimes un-optimised benchmark figures are quoted, even more so when Linpack is run on mobiles; and those have sometimes equivalent, often superior, Mflop potential as the Pi, Samsung Galaxy Mini being a good example of the former as it uses a SoC which is nearly identical to the Pi's. The unoptimised benchmark Linpack that has been referred to above can be downloaded from the netlib.org WebsiteE.36. In order to compile it, the following command should be issued: cc -O -o linpack linpack.c -lm When we compile and execute the unoptimised version on Raspberry Pi we get lower results, roughly 3 times less what we get using HPL which is ARM-HF optimised (see paragraph below). The same will apply to generic Atlas libraries for x86 processors as opposed to sse-specific and 3dnow-specific (ie optimised) libraries for Intel and AMD processors, respectively 15.3.4 . We compile linpack specifically on Cora and Mary. We get 50Mflops on Cora, 180Mflops on Mary as opposed to 55MFlops with local xhpl and no optimisation or but 755MFlops with MPI-parallelised processees and sse optimisation. We only get 52MFlops on Pi, close to Cora's Klamath PentiumII. Those figures are interesting when pitched against the rest of HPL benchmarks on the charts.

66 15.3 HPL (High-Performance Linpack) Benchmark The Linpack HPL package was chosen which will be run on the Beowulf system to see how many G/MFLOPs are obtained. Linpack is a very useful benchmarking package, with a few options when it comes to compiling it. The main disadvantage is that the benchmarks tend to overestimate the performance that real-world scientific applications can expect to achieve on a given system. This is because the LINPACK codes are "dense matrix" calculations, focussing on data locality. It is not uncommon for the Linpack benchmarks to achieve 30 percent or more of the theoretical peak performance of a system. Real scientific application programs, in contrast, seldom achieve more than 10 percent of the peak figure on modern distributed-memory parallel systems such as Beowulf systems. Linpack needs the ATLAS Algebra libraries installed so that we compile against them. First the generic, then the optimised, versions will be used, to see what happens.

15.3.1 Installing, compiling HPL on Pi and on all x86 machines

We install on both ARM and x86: apt-get gfortran libatlas-base-dev The ATLAS libraries that were downloaded and installed on the Pi are: libatlas-base-dev_3.8.4-9_armhf.deb libatlas3-base_3.8.4-9_armhf.deb libatlas-dev_3.8.4-9_all.deb These libraries are optimised for the Hardware Floating point implementation on the Raspberry Pi's ARM chip and that is why we will extract about ten times more performance out of the Pi with these libraries as compared to running unoptimised Linpack benchmarksE.35. In case those were not available at the moment of installing we will be delighted to supply the copy we keep on file. Those for x86 are the generic non-SIMD versions for generic x86 processors. The Linpack HPL source code can be found here: http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz Extract the tar file and create a makefile based on the given template: run sh setup/make_generic then cp setup/Make.UNKNOWN Make.rpi

then edit Make.rpi with the following parameters: ARCH = rpi TOPdir =$(HOME)/hpl-2.1 MPlib = -lmpi

67 Ladir = /usr/lib/atlas-base/ Lalib = $(LAdir)/libf77blas.a $(Ladir)/libatlas.a Exactly the same steps will be followed on x86 but replacing 'arm' with 'x86' and different parameters: MPlib = -lmpi LAdir = /usr/lib/atlas LAlib = -lblas Compile linpack. The xhpl binary will be placed in bin/rpi/ on RaspberryPi and bin/x86 on x86 (using Mary): make arch=rpi make arch=x86 Then we need to create an appropriate HPL input file bin/rpi/HPL.datC.4 populated with critical input values to the Linpack software.

15.3.2 Compiling on Cora

This machine is running the older Ubuntu Dapper 6.06. Therefore hpl should be specifically compiled on it and the executable will be different from all other x86 executables present – optimised or unoptimised. In order to compile the procedure outlined in 15.3.1 should be followed. The resulting executable will be placed in /hbcrun and it will be called xhpl just like the rest so that Pi's xhpl.sh script can invoke it when running mpiexec. In practice it would not be very wise to use this unit as it is the slowest on the cluster and it would not help improve the performance. At the time of writing, some dependencies were unmet when trying to compile HPL on Cora, even if atlas and mpich libraries had been installed on it. (By leaving Mplib blank instead of -lmpi, an executable is obtained, but it does nothing when run.) More work might be needed to solve the current problem since Cora is running Dapper, an old version of Ubuntu, which may need different compile-time options and / or installation of other packages, and less support is available.

15.3.3 Know thy cluster

It is vital to know not only what architectures we are using but also what memory sizes, network speeds, etc, each of our compute nodes is endowed with. To start with, the problem size is limited by the amount of memory on the machine with the least physical RAM. This is important, because performance improves as the problem size increases. The goal is try to run with as large an N as time and hardware

68 permit. To determine the maximum problem size we use a calculation suggested in E.41: N = 0.8 * sqrt (M/8) where N is the problem size and M is the total amount of available memory in bits (calculated by multiplying the bits of memory in the smallest machine by the number of machines). Furthermore, it will be assumed that 80% of the total available memory may be used by HPL, while 20% is for the operating system and other user processes. Each matrix entry is one byte. For the HPL.dat file format and parameters see . A test run on a single RaspberryPi node without using the MPI system prints about 270Mflops. However using the MPI system it prints 170 Mflops. Running it on two RaspberryPi's, through the MPI system of course, will give us 230MFlops. A test run on Mary, the 2x 600MHz PentiumIII Coppermine machine, gives us about 50MFlops, without going through the MPI system. In this case only one process is being specified which the OS will send out to both CPU's and it will be the OS which will do the load balancing, clearly not as efficient as giving out two separate tasks, one to each CPU. In the latter case the figure goes up to around 65MFlops although it needs to go through the MPI system. In the event we decide to run xhpl on all 15 nodes including Cora we need to be especially careful not to force Cora to use the swap partition.

15.3.4 Atlas libraries and floating point capabilities

The standard libatlas-base libraries have been used in order to compile xhpl. This gives us an idea of the processing power of our cluster but without taking into account details like each CPU's numeric processing power. There are ways to improve performance on a CPU, namely, using its specific number crunching capabilities. – AMD CPUs were endowed with the 3dnow set of instructions since 1998. Those were introduced to improve on the performance of the CPU when carrying out graphic-intensive tasks.E.37 – PentiumIII CPUs are SSE enabled. SSE, or Streaming SIMD [Single Instruction Multiple Data] Extensions, is a set of special integer and floating point instructions designed to improve the CPU's numeric performance as regards graphics and other purposes. Those were introduced in 1999 after AMD had already got out their 3dnow.E.39 Thus three versions in total will be compiled (see 15.3.6 , 15.3.7 ):

69 – the ARM version for execution on Raspberry Pi's, placed in /hbcrun directory on Pi's, those will be based on the standard non-optimised Atlas libraries i.e. Armel rather than armhf, – On the AMD based computers libatlas-3dnow will be installed and xhpl accordingly compiled. This executable will be placed in /hbcrun on Daisy and MrsHughes. – On the PentiumIII based computers libatlas-sse will be installed and xhpl compiled. This executable will be placed in /hbcrun on MrCarson, Mary, Sybil, Edith and Anna. – It should be noted that it is also possible to further optimise the Atlas libraries on the ARMHF (Hard Floating Point) unit on the Raspberry Pi. This can require some extra work. At the moment we will be using the standard, hopefully pre-optimised libraries.

15.3.5 Setbacks

Unfortunately, on Cora, the PentiumII based computer, only MMX instructions (from the PentiumPro) are available. For a start, it will not be able to run any SSE optimised code. Additionally, xhpl is supposed to run on x86 is libatlas3gf and libgfortran libraries are present. Those are not present on Cora. They were copied them from Mary (libatlas* to /usr/lib/atlas and libgfortran* to /usr/lib) but they ultimately depend on glibc 2.4 to run and glibc is not installed on Cora (Ubuntu 6.06 Dapper). Then installing libstdc++2.10- glibc2.2 is given a try to see if it helps. However the system then informs us it wants libmpich1.2 whereas Cora has libmpich1.0. Substantial changes to Cora's libraries would be needed in order to run xhpl. ARMv7 (Cortex-A8 et al) CPUs have the NEON instruction set which is a SIMD instruction set equivalent to Intel's SSE and AMD's 3dnow. However, RaspberryPi's ARM CPU is only a much cheaper ARMv6 ARM which is not NEON enabled. Thus there are not packages in the Raspbian distribution that allow compiling programs based on NEON optimised versions of libatlas, which could be done on Intel and AMD based computers.

15.3.6 Compiling xhpl on AMD: 3dnow

apt-get install libatlas-3dnow-dev gfortran wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz tar xf hpl-2.1.tar.gz cd hpl-2.1/setup sh make_generic

70 cd .. cp setup/Make.UNKNOWN Make.3dnow Edit Make.3dnow with the following parameters: ARCH = 3dnow TOPdir =$(HOME)/hpl-2.1 MPlib = -lmpi Ladir = /usr/lib/3dnow/atlas/ Lalib = $(LAdir)/libblas.a Compile linpack. The xhpl binary will be placed in bin/3dnow/ on MrsHughes: make arch=3dnow cd bin/3dnow Before running xhpl, HPL.dat needs to be adjusted to just 1 process which will be sent to both CPU's at once on MrsHughes, the OS determining how to balance the load. This still gives a staggering 2.4Gflops.

15.3.7 Compiling xhpl on PentiumIII, SSE

apt-get install libatlas-sse-dev gfortran wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz tar xf hpl-2.1.tar.gz cd hpl-2.1/setup sh make_generic cd .. cp setup/Make.UNKNOWN Make.sse Edit Make.3dnow with the following parameters: ARCH = sse TOPdir =$(HOME)/hpl-2.1 MPlib = -lmpi Ladir = /usr/lib/sse/atlas/ Lalib = $(LAdir)/libblas.a Compile linpack on e.g. Anna. The xhpl binary will be placed in bin/sse/ on Anna: make arch=sse cd bin/sse mv /hbcrun/xhpl /hbcrun/xhpl.x86.generic cp xhpl /hbcrun root@anna:~/hpl-2.1/bin/sse# chown hbcuser:hbcuser /hbcrun/xhpl Before running xhpl HPL.dat needs to be adjusted with just 1 process which will be sent to Anna's single CPU. This gives us 615MFlops which is still quite good when compared to previous results running the generic code (more than ten times slower) and makes much more sense if compared to the Pi's 185MFlops.

15.3.8 Rough results, RaspberryPi as Master node

Quite often the more nodes we add, the slower the whole system tends to become. Asymmetry is to blame.

71 However, by running the task only on the three Kayak's processing power is actually multiplied: Mary's non-optimised (see further on) 65MFlops x3 = 190MFlops gets close to the 170 Mflops obtained when teaming up Mary with Sybil and Edith.

15.3.9 Using Mary as Master node

This meant exporting the /hbc directory from Mary rather than RaspberryPi, thus RaspberryPi's had to be unexported, etc. No difficulties but no improvements have been observed. Therefore, for simplicity reasons we will stick to using RaspberryPi as the main node. We will access it either from the outside (no-ip ssh port 80) or locally (using Mary as a terminal).

15.4 Pre-conclusions on system balancing It is crucial to tune executables to each x86 node's capabilities. It has been shown how this can be done. By utilising each node's SIMD instruction set the speed on each node has increased roughly tenfold. Care must be used with N on low memory systems like Anna since values of around 8000 will make it swap to and from /swap thus bringing down CPU performance from 99% to around 5%. We have found other very interesting benchmarking tools in various forums and those shall be described below.

15.5 The NAS Benchmark There are other benchmarks we may envisage to run eventually on the cluster e.g. the NAS Parallel Benchmark Suite developed at NASA. This helps better assess the performance of a cluster since it comprises the following benchmarks: EP: An "" calculation, it requires almost no interprocessor communication. MG: A multigrid calculation, it tests both short - and long-distance communication. CG: A conjugate gradient calculation, it tests irregular communication. FT: A three-dimensional fast Fourier transform calculation, it tests massive all-to-all communication. IS: An integer sort, it involves integer data and irregular communication. LU: A simulated fluid dynamics application, it uses the LU approach. SP: A simulated fluid dynamics application, it uses the SP approach. BT: A simulated fluid dynamics application, it uses the BT approach.

72 Since the original NPB release, implementations of the NPB using MPI and also OpenMP have been provided by the NASA team. These are available at www.nas.nasa.gov/Software/NPB

15.6 The Ping-Pong Test This should be fun. One of the most widely used measurements, it measures the latency and band-width of the interprocessor communications network. There are a number of tools for testing TCP performance, including netperf and netpipe (see www.netperf.org and www.scl.ameslab.gov/netpipe). Ping-Pong tests that are appropriate for application developers measure the performance of the user API and are typically written in C and rely on our MPICH communications library.

15.7 Real World Tasks There are a number of packages available for compilation and testing on our MPICH2 system. Here is a short list: • Molecular dynamics simulation, binaries for MPICH parallelization. GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins and lipids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non- biological systems, e.g. polymers. This package contains only the core simulation engine with parallel support using the MPICH (v2) interface. It is suitable for nodes of a processing cluster, or for multiprocessor machines. Package name: gromacs-mpich (in Raspbian and Ubuntu10.04.4 repositories) Description: http://packages.debian.org/wheezy/gromacs-mpich • MIT Photonic-Bands, parallel (mpich) version. The MIT Photonic-Bands package is a free program for computing the band structures (dispersion relations) and electromagnetic modes of periodic dielectric structures, on both serial and parallel computers. It was developed by Steven G. Johnson at MIT in the Joannopoulos Ab Initio Physics group, and designed to study photonic crystal structures. This package contains a parallel version of MPB, using the mpich implementation of the MPI protocol. It allows for calculations on clusters of computers. Package name: mpb-mpi (in both architectures' repositories). Description: http://packages.debian.org/wheezy/mpb-mpi

73 • Illuminator Distributed Visualization Library: demos . This little library provides contour surface viewing for PETSc's 3-D distributed array (DA) objects using the Geomview viewer, and distributed storage and retrieval of PETSc DAs of any dimensionality in the Illumulti (optionally compressed) binary format. This package contains the tsview viewer for 2-D and 3-D timestep sequences stored in IlluMulti format. This package also contains two demonstration programs: "chts" (Cahn- Hilliard timestep) with its front-end "chui" (Cahn-Hilliard User Interface), and "3dgf" (3-D potential Green's function visualizer). With mpich, you can run these in parallel using e.g. "mpirun -np X /usr/bin/chts" where X is the number of processes (optimally equal to the number of processors), with only process 0 requiring access to your X display for 3-D graphics. Package name: illuminator-demo (in both architectures' repositories) Description: http://packages.debian.org/squeeze/illuminator-demo Many more are available in the Debian (and as such Ubuntu) repositories.

15.8 MPICH and Python In particular, there is a version of the Python interpreter which is MPI-enhanced. It is the MPI-enhanced Python interpreter (MPICH2 based version). This is particularly suited to teaching since it serves a double purpose as the package provides a python interpreter with MPI (Message Passing Interface, message-based parallel programming) support. We think the package can be especially useful in the teaching environment since Python is an excellent tool for programming tuition. Dr Verdaguer is already using Python in the classroom to that purpose. Package name: mpich2python (in both architectures' repositories) http://packages.debian.org/wheezy/mpich2python

74 Chapter 16

Starting up

75 16 Starting up

16.1 Firing Up the system First Pi master node is started up by plugging into the socket the mobile phone charger. Then the computer nodes will follow. This is straightforward although it is important to allow Pi master node finish booting first (give it over 2 minutes then try ssh'ing into it to see if it is fully alive) so that all compute nodes can mount the NFS share for the MPICH system with no problems (the mount command is in /etc/rc.local on each machine). Then Pi99 is started up. There are three sets of multiple sockets with switches on each socket. We start by sequentially switching on every socket. Then we sequentially switch on every computer. It is important to give the system some time to boot. Firstly, some of the computers unavoidably go through a series of POST's (not all of them can be turned off on the machines' BIOS) and that already takes some time. Additionally, when the computers realise they have no keyboard, mouse or monitor present, they beep angrily and show a message on their unexistent monitors, which makes the booting process even slower.

16.2 Starting HBC The system has been configured so that port 80 (the port that is available at any location with an Internet connection) in order to ssh into the Pi. Pi Master Node has been set up accordingly and the residential router is forwarding the port to the Pi as well. This means that in case we had a Web server this would also have to reside on this particular Pi. In our case we are accessing the system in two different ways: – Using Mary as a terminal. This means the system will be accessed locally ie from its own subnet. – Using Pi as our gateway. This means we will connect via the Internet and our terminal (keyboard and screen) will be located somewhere else. We will also be able to connect using our Internet enabled mobile as a terminal. To ssh into the main Raspberrypi the following command is issued ssh -XC root@raspberrypi once there, the system can be easily and quickly accessed by typing hbc_start

76 which is a small script to the /hbc directory and su'ing into hbcuser C.3. Once there, machinefile will be edited if needed: nano machinefile just uncommenting each machine name in order to use it, and finally executing the programs in their mpi environment: mpiexec -f machinefile -n 15 ./icpi-info.sh

77 Chapter 17

Environmental Consideration Energy Costs

78 17 Power, Efficiency and Environmental Considerations

17.1 Cooling The computers must be positioned so that air flows freely sucked in at the front of the computers and expelled at the back. It is also important that the outlets are not blocked by walls or other equipment that is too close.

17.1.1 A Pi Quibble and a Bent Card

Raspberry Pi's can get quite hot if in summer and standing on a warm piece of hardware. In our instance, our Pi was living on the ADSL residential router, both kept turned on at all times, and all of a sudden when outside temperatures started reaching 30ºC, the Pi stopped functioning. We thought the SD card had died as the Pi started up again off a new SD card which a backup image had been copied onto. However, upon closer examination of the older card, it turned out it was badly bent so it stopped making contact with the SD card socket on the Pi which was intact. We slightly cracked the plastic cover on the card so as to 'un-bend' the SD card and reinserted it. It then carried on working for over a month. However, for more worldly purposes ie having a more reliable system, we retired the card and started using a newer card, branded by Toshiba and made in Japan, and its plastic cover looks and feels much sturdier.

17.2 Room Temperature Running all the nodes on the cluster at the same time performing a particular task will use a total of 830 Watt. This is obtained from the power consumption chart B.1.

Supposing: – the system is kept running 24 hours per day,

– room temperature is around 20ºC,

– room is around 4 metres x 3 metres x 3 metres, huge temperature increase calculation results are obtained due to the fact that initial formulae assume there is no heat loss due to windows and walls, seasonal circumstances,

79 room facing north / south and so on. If we had to accurately work out the actual heat loss and thus the actual increase in temperature inside the room wall surface and concrete and brick heat transfer ratio would have to be taken into account. The latter is around 1:10 when pitted against the heat transfer ratio associated to windows. Then adding all those together the amount of heat lost in this way would be calculated. Since the goal of this project is not a study in thermodynamics, an empyric method will be used, namely the observation of an increase in room temperature in the region of 2-5ºC depending on the season. However it is essential students understand that our system needs a certain amount of energy in order to run, which leaves an environmental footprint behind, and most of that energy used up by our cluster is converted into heat.

17.3 Performance per Watt It must be calculated what a reference computer is capable of and obtain some sort of Gflops to Watt ratio figure. It makes more sense to pitch x86 and ARM computers against each other by using Gflop benchmark figures. That is why Linpack benchmarks are used to do those calculations. In computing, performance per watt is a measure of the energy efficiency of a particular or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed. According to WikipediaE.38: The performance and power consumption metrics used depend on the definition; reasonable measures of performance are FLOPS, MIPS, or the score for any performance benchmark. (...) The power measurement is often the average power used while running the benchmark, but other measures of power usage may be employed (e.g. peak power, idle power). (...) UNIVAC I computer performed approximately 0.015 operations per watt- second (performing 1,905 operations per second (OPS), while consuming 125 kW). The Fujitsu FR-V VLIW/ system on a chip in the 4 FR550 core variant released 2005 performs 51 Giga-OPS with 3 watts of power consumption resulting in 17 billion operations per watt-second. This is an improvement by over a trillion times in 54 years. Most of the power a computer uses is converted into heat, so a system that takes fewer watts to do a job will require less cooling to maintain a given operating temperature. (...) Lower energy consumption can also make it less costly to run, and reduce the environmental impact of powering the computer (...)

80 Performance (in operations/second) per watt can also be written as operations/watt-second, or operations/joule, since 1 watt = 1 joule/second. (...) FLOPS (Floating Point Operations Per Second) per watt is a common measure. Like the FLOPS it is based on, the metric is usually applied to scientific computing and simulations involving many floating point calculations. Examples: (...) BlueGene/Q, Power BQC 16C as the most efficient supercomputer on the TOP500 in terms of FLOPS per watt, running at 2,100.88 MFLOPS/watt. More relevantly, HBC contains a particular machine that is widely used in Illustration 18: An Intertek power meter measuring MrCarson's power consumption. beowulf applications, MrsHughes. Microwulf, a low cost desktop Beowulf cluster of 4 dual core Athlon 64 x2 3800+ computers, runs at 58 MFLOPS/watt. This figure is very closeB.3 to what we have been obtaining when MPI has been distributing tasks to: – just two RaspberryPi's at 39 MFLOPS/watt,

– just MrsHughes on her own at 72 MFLOPS/watt . More on this below.

17.4 Mflops per Watt on HBC For all results obtained (76 test runs in total) see chart B.2. The most interesting (peak) values are: – Just one Pi running (no MPICH needed in this case) is 59.3 Mflops per Watt.

– Just Mrs Hughes running (MPICH sending two processes to it) with 3dnow extensions enabled on xhpl executable (calling libatlas-3dnow) is 72.2 Mflops per Watt. And the worst values are: – Whole system running (except for Cora) is a measly 0.77 Mflops per Watt Too heterogeneous.

81 – Whole system running (except for Cora) without sse or 3dnow atlas libraries enabled in the xhpl executables is an even lower 0.42 Mflops per Watt. Those results prove the system will perform suboptimally the more heterogeneous it is. We can make the system more or less heterogeneous depending on the nodes we choose to run HPL on. For a chart with the more relevant results see chart .

17.5 Performance on HBC

– Peak performance is equally achieved by MrsHughes running on her own with 3dnow enabled at 4550 Mflops. – Performance of the three Kayaks with sse enabled (nearly homogeneous) is 1131 Mflops whereas performance of a single Kayak is roughly over 700 Mflops. – Lowest performance is a single Kayak with no sse optimisation at around 65 Mflops.

– Reasonable performance is a single RaspberryPi at around 180 Mflops on its own.

82 Chapter 18

Costs and Considerations

83 18 Costs and Considerations

18.1 Man-hours The purpose of doing so is showing students that work must be charged for. Assuming man-hours cost €10, the time we have spent on the project can be roughly summarised as follows: 6 hours per day during summer holidays. 3 weeks. 6 hours per day during winter holidays. 1 week. 6 hours per day at weekends = 2 hours per day in the week. 10 months.

# hours / # hours / Hours per Period # days weeks day week period Winter holidays 7 6 42 Summer holidays 20 6 120 Sept – Dec 15 6 90 Jan – Jul 28 6 168

420 Total hours €10.00 Man-hour €4,200.00 Total €

18.2 Electricity

Assuming a cost of €0.063533/kWh in Spain and a power consumption of 830 Watt:

830W x 1kW/1000W x 24 hours x 0.166 €/kWh = 3.307 €

Using the Beowulf for a full month non-stop would thus involve a monthly charge of over €100 on top of any other monthy charge.

84 Chapter 19

Smartphones

85 19 Smartphones for Extended Eclecticism This project can also be used to teach students the many things that can be done with high and low end smartphones.

19.1 Android phones In particular, Android handsets are all running what is basically a modded version of Linux, sharing the kernel, and once we 'root' them, we can use them as though they were Raspberry Pi's in many ways. It will 'suffice' to find and compile MPICH2 for an Android phone. This should be done on a cross compiler rather than the phone itself. There are many resources on the Net with tutorials and it is of paramount importance the one that is most suitable to our needs is found and followed by which we mean our smartphone in particular (see 19.2 ). It has been shown that it is possible to cross-compile mpich2-1.3.2 for ARMv5 platforms (almost all Android phones). More information can be found at E.42.

19.2 Linux on a Phone vs Pi Ideally a fully fledged version of Linux would be installed on the handset. This can be done on some, although this means not only doing some research when it comes to a distro that is perfectly suited to the handset's hardware and specification but also losing the phone's functionality. Furthermore, it might also mean the phone becomes 'bricked' in the process, thus rendering it useless. The reason is that the boot binaries are in flash memory that is on the phone and cannot be accessed from the outside. If for any reason those are corrupted, the phone will not boot again until new boot binaries are 'pushed' into the internal Flash memory by soldering wires on the motherboard, which is a time consuming and potentially fatal procedure that should be avoided. That is why the Pi is so safe: it will always look for the boot binaries on the SD card so there is no way we can brick it.

86 Chapter 20

Conclusions

87 20 Conclusions Putting together a heterogeneous cluster using old to very old computers and controlled by a Raspberry Pi can work towards giving students tools in order to learn and become interested in science, engineering, technology and mathematics, which are the building blocks of computer science.

20.1 Low cost It has been shown throughout this project how a Beowulf built with (i) x86 based computers that were about to be sent to be scrapped, (ii) a couple of new and very inexpensive RaspberryPi's, can perform quite a few interesting tasks.

20.2 Raw computing power RaspberryPi's offering raw computing power? No, not really. There are many advantages to using RaspberryPi's as opposed to PC's: – they do not generate heat

– they take up very little space

– they can be played around with easily (just swapping SD cards does tricks)

– each Pi can be given a partial task to perform in the cluster,

– last but not least each student can purchase a Pi for less than €35 and thus become interested in computer science, be it software or hardware oriented or both.

20.3 The need for Physical Nodes Elaborating on this, each Pi can given a particular, physical task to perform. Like collecting vital on-field information that will be fed to the cluster. • E.g. infrared information that is then to be processed by the system. Or wireless information collected by a receiver then coupled to the Pi. • E.g. one (or several) GPS receiver per Pi, speaking to it via Bluetooth (would require a Bluetooth module and its drivers and setup), so that each Pi is located in a different spot. The Pi's could then be networked over WiFi (again this would need the WiFi dongle) so they could be moved around more easily. • When it comes to powering the Pi's, a charged, portable mobile battery of about 3000mAh could be used for each Pi so that it would last the whole length of the

88 run.

20.4 Setting up a particular Cluster Architecture Choosing the hardware if at all possible is best when it comes to setting up a computer cluster with a particular goal in mind. Adding and removing old x86 based PC's and ditto RaspberryPi's until the desired architecture is achieved can prove vital to fine tuning a Beowulf project.

20.5 Drawbacks Fine tuning a heterogeneous cluster like this one is challenging, exciting at the same time, albeit it can be time consuming. Also, since one of the goals was maximising results from our hardware, it is a real hindrance not having been able to use the Raspberry Pi's GPU (CPU after all) in order to dramatically increase CPU throughput. In particular each Pi's processing power would be enhanced one hundred fold ie a couple of orders of magnitude.

20.6 Tinker Around with Thy Cluster Great fun and extremely educational. This teaches the cluster user / optimiser to learn and improve in the fields of: – raw programming

– architecture-based library and program optimisation

– cluster configuration to get the most out of it as opposed to the least, which could happen if wrongly configured – parallel programming, hyper matrix definition and fine tuning

– new heuristic ideas from the real world to make the system perform better

– have fun.

20.7 Showing the World We have a bunch of old computers that can actually do something!

20.8 The Future We hope this dissertation gives an interesting outline of how a Beowulf cluster can be built for practical and educational purposes. For low-cost, efficient and educational purposes, we hope the Raspberry Pi community will ensure a GPU API is made available so that the most can be got out of the GPU.

89 When it comes to supercomputer clusters, as already outlined at the Barcelona Centre for Supercomputing with the ARM based Montblanch supercomputer, the future lies in using multi-core ARM CPUs conveniently communicated to their (likely multi-core) GPUs. It should be realised though that in order to build an extremely powerful supercomputer it is more straightforward to start by using either PowerPC CPUs with their fantastic processing power or AMD chips (obviously more powerful and efficient than their Intel counterparts). It goes without saying that supercomputer clusters around the world use GNU, and most of them GNU Linux, it being the most stable, secure and highly configurable operating system available, nowadays every bit as powerful as old UNIX, or even more, thanks to the international community's efforts and contributions.

90 Appendix A

List of Hardware Costs Features

91 A List of Hardware Integrated chart with computers and accessories, hardware Features and Costs.

Cost, CPU Name / Notes Brand # Cost Bought from Model RAM RAM type CPU # CPU's Memory MHz subtotal MHz Mary HP 1 €0.00 €0.00 Donated Kayak XM600 0MB RDRAM Coppermine 1 600MHz 133MHz MrCarson Dell 1 €0.00 €0.00 Donated PowerEdge 1500sc 256MB Coppermine 1 1.1GHz 133MHz RaspberryPi Rpi Found. 1 €28.00 €28.00 Farnell Model B 512MB SDRAM ARM v6l 1 700MHz 400MHz RaspberryPi99 Rpi Found. 1 €28.00 €28.00 Farnell Model B 512MB SDRAM ARM v6l 1 700MHz 400MHz Sybil HP 1 €0.00 €0.00 Donated Kayak XM600 128MB Coppermine 1 600MHz 100MHz Edith HP 1 €0.00 €0.00 Donated Kayak XM600 382MB Katmai 1 600MHz 100MHz Daisy Unbranded 1 €0.00 €0.00 Donated N/A 512MB AMD 1 1GHz MrsHughes HP 1 €0.00 €0.00 Donated Pavilion 1GB AMD Turion64 2 1.6GHz 800MHz FSB Anna Webgine 1 €0.00 €0.00 Donated 1115XL 256MB Coppermine 1 1GHz 133MHz Cora HP 1 €0.00 €0.00 Donated Vectra VL 96MB Klamath (PII) 1 300MHz 66MHz €0.00 IDE HDD for Daisy Seagate 1 €0.00 €0.00 Donated ST IDE HDD for Edith Seagate 1 €0.00 €0.00 Donated ST 128MB RAM for Sybil HP 2 €0.00 €0.00 Donated SIMM N/A Intel 2 €0.00 €0.00 Donated Pentium III Katmai 500MHz 96MB RAM for Cora HP 1 €0.00 €0.00 Donated SIMM €0.00 N/A SanDisk 1 €14.00 €14.00 Mediamarkt 16GB Usb Stick Pi SD Memory Card Integral/Toshiba 1 €6.70 €6.70 Amazon UK 8GB Pi99 SD Memory Card Integral/Toshiba 1 €6.70 €6.70 Amazon UK 8GB Extra SD Memory Card Integral/Toshiba 1 €6.70 €6.70 Amazon UK 8GB Mary 3 €2.78 €8.34 eBay UK EtherNet cable 5m MrCarson 2 €0.00 €0.00 EtherNet cable 5m Edith 1 €2.14 €2.14 eBay UK EtherNet cable 3m Daisy 1 €2.14 €2.14 eBay UK EtherNet cable 3m MrsHughes 1 €2.14 €2.14 eBay UK EtherNet cable 3m Anna 1 €2.14 €2.14 eBay UK EtherNet cable 3m Cora 1 €2.14 €2.14 eBay UK EtherNet cable 3m Switch to switch 1 €2.14 €2.14 eBay UK EtherNet cable 3m Switch to ADSL r. 1 €2.78 €2.78 eBay UK EtherNet cable 5m Case for Pi 1 €5.70 €5.70 Farnell Element14 Case forPi99 1 €5.70 €5.70 Farnell Element14 17” Monitor Dell 1 €0.00 €0.00 Donated 17” Monitor Keyboard HP 1 €0.00 €0.00 Donated PS/2 Keyboard Mouse HP 1 €0.00 €0.00 Donated PS/2 Mouse Power Meter Intertek 2 €13.00 €26.00 eBay UK Socket PM Switched Power Socket Lidl 2 €8.00 €16.00 Lidl 6 sockets 5-port 100M switch Pluscom 2 €7.68 €15.36 eBay UK 8-port 100M switch Conceptronic 1 €0.00 €0.00 eBay UK 8-port 100M switch LB-Link 1 €9.23 €9.23 eBay UK 8 sockets Pi power supply Samsung 1 €6.67 €6.67 eBay UK Wall charger Pi99 power supply Samsung 1 €6.67 €6.67 eBay UK Wall charger 1GB RAM for Dell Infineon 2 €12.30 €24.60 eBay UK 1GB RAM CPU for Dell Intel 2 €17.30 €34.60 eBay UK PIII Coppermine 1.13GHz Heatsink for Dell CPU Intel 1 €12.00 €12.00 eBay UK for second processor VRM for Dell CPU Dell 1 €12.00 €12.00 eBay UK for second processor RDRAM terminator module HP 1 €12.00 €12.00 eBay UK for Mary 256MB RDRAM Samsung 4 €3.75 €15.00 eBay UK for Mary

Subtotal €315.59

92 Appendix B

Benchmarks Power Consumption

93 B Benchmarks. Power consumption

B.1 Local Benchmarks. Individual consumption This has been compiled using an N=4096 and just 1 theoretical process executing on every machine and distributed across dual-core machines' CPUs by the OS.

xhpl N 4096 MFLOPs MFLOPs Con- Con- MFLOPs on MFLOPs on MFLOPs on Con- Consumption IP: through on xhpl sump- sump- BogoMI- MPICH RAM # CPU Memory local, 1 local, 1 xhpl con- sump- when work- Name 192.16 OS RAM CPU CPU MHz MPI, full controlled tion tion PS (each v type cores MHz process, process, trolled by Pi, tion on ing (full load 8.0. load, optim- by Mary, when off when CPU) optimised generic generic full load through MPI) ised generic W idle W Mary 204 Ubuntu 10.04.4 1.4.1 512MB RDRAM Coppermine 2 600MHz 133MHz 435 49.5 755 64 79.4 4.5 94 124 124 1198 SDRAM MrCarson 205 Ubuntu 10.04.4 1.2.1 2.3GB Coppermine 2 1.3GHz 133MHz 822 148 1317 243 - 3.8 118 152 152 2527 DIMM 3.0 (RAM Debian Wheezy RaspberryPi 101 1.4.1 512MB SDRAM ARM v6l 1 700MHz 400MHz - 177 N/A 175 - 1 2.8 3 only) 3.3 (SD 698 7.0 access) 2.9 (RAM Debian Wheezy RaspberryPi99 99 1.4.1 512MB SDRAM ARM v6l 1 700MHz 400MHz - 183 N/A 178 - 1 2.7 3 only) 3.2 (SD 698 7.0 access) SDRAM Sybil 207 Ubuntu 10.04.4 1.2.1 382MB Coppermine 2 600MHz 100MHz 393 39.2 620 61.1 - 4 87 120 120 1198 DIMM SDRAM Edith 206 Ubuntu 10.04.4 1.4.1 382MB Katmai 2 500MHz 100MHz 326 36.2 538 58.5 - 2.5 87.5 135 135 1000 DIMM SDRAM 95 generic – Daisy 201 Ubuntu 10.04.4 1.2.1 512MB AMD Athlon 1 1GHz 266MHz 1142 72.4 N/A N/A - 3.2 87.5 107 2018 DIMM 107 3dnow AMD Tur- SODIMM MrsHughes 203 Ubuntu 10.04.3 1.2.1 1GB ion64 X2 TL- 2 1.6GHz 667MHz 2480 483 4235 834 - 2.5 32 63 63 1607 DDR2 52 Anna 202 Ubuntu 10.04.4 256MB SODIMM Coppermine 1 1.1GHz 133MHz 602 48 N/A N/A - 5 33 61 61 2194 Cora 21 Ubuntu 6.06.2 1.4.1 168MB DIMM Klamath (PII) 1 300MHz 66MHz N/A N/A N/A N/A - 2.2 33 63 63 602

total when total full 29.7 831 off, W load, W

94 B.2 HBC Benchmarks Yellow: no architecture optimisation. Cyan: architecture optimised. MrsH Homog Pi99 MrCar Mary Edith Cora Anna Dais # Mstr Time Watt full Mflop/ Tst # Pi 1 Sybil 2 ughes # procs N P Q NB mpi? Mflops eneity 0 Notes Notes 2 Notes 3 1 son 2 2 2 1 1 y 1 cores Node (s) load Watt 2 to 3 1 1 1 1 Pi 4000 1 1 168 1 170 3 56.67 3 ./xhpl.sh 2 1 1 1 Pi 4000 1 1 168 1 178 3 59.33 3 ./xhpl.sh 3 1 1 2 2 Pi 4000 1 2 168 1 233 6 38.83 3 ./xhpl.sh 4 1 1 2 Pi 4000 1 2 168 1 65 ./xhpl.sh no µc upd/3dnw 5 1 1 2 Pi 4000 1 2 168 1 243 ./xhpl.sh no µc upd/3dnw 6 2 2 2 2 8 8 Pi 4000 1 8 168 1 195 ./xhpl.sh no µc upd/3dnw 7 1 1 1 3 6 Pi 4000 1 6 168 1 162 ./xhpl.sh no µc upd/3dnw 8 1 1 2 2 2 2 10 10 Pi 4000 1 10 168 1 266 ./xhpl.sh no µc upd/3dnw 9 1 1 2 4 4 Pi 4000 1 4 168 1 344 156 2.21 1 ./xhpl.sh no µc upd/3dnw 10 1 1 1 1 4 4 Pi 4000 1 4 168 1 195 ./xhpl.sh no µc upd/3dnw 11 2 2 2 6 6 Pi 4000 1 6 168 1 171 379 0.45 2 ./xhpl.sh no µc upd/3dnw 12 1 1 1 Mary 4000 1 1 168 1 79 ./xhpl.sh no µc upd/3dnw 13 1 2 2 5 5 Mary 4000 1 5 168 1 168 ./xhpl.sh no µc upd/3dnw 14 1 1 2 Mary 4000 1 2 168 1 233 ./xhpl.sh no µc upd/3dnw 15 2 2 4 4 Mary 4000 1 4 168 1 150 ./xhpl.sh no µc upd/3dnw no sse 16 1 2 3 3 Mary 4000 1 3 168 1 279 155 1.8 0 ./xhpl.sh no µc upd/3dnw no sse 17 2 2 2 6 6 Pi 10752 2 3 168 1 168 ./xhpl.sh no µc upd/3dnw no sse 18 1 1 2 2 2 2 2 1 1 14 14 Pi 4096 1 14 168 1 281 163 ./xhpl.sh no µc upd/3dnw no sse 19 1 1 2 2 1 1 8 8 Pi 4096 1 8 168 1 293 156.7 ./xhpl.sh no µc upd/3dnw no sse 20 1 1 2 2 2 2 2 1 1 14 14 Pi 6000 1 14 168 1 319 451.14 768 0.42 0 ./xhpl.sh no µc upd/3dnw no sse 21 1 1 2 4 4 Pi 6000 1 4 168 1 476 303 69 6.9 0 ./xhpl.sh no µc upd/3dnw no sse 22 1 1 2 2 6 6 Pi 6000 1 6 168 1 579 248.6 221 2.62 0 ./xhpl.sh no µc upd/3dnw no sse 23 1 1 2 2 1 7 7 Pi 6000 1 7 168 1 433 333 328 1.32 0 ./xhpl.sh no µc upd/3dnw no sse 24 1 1 2 2 Pi 9240 1 2 168 1 304 1728.3 6 50.67 3 ./xhpl no µc upd/3dnw no sse 25 1 1 2 2 Pi 9240 1 2 168 1 263 2005 ./xhpl.sh 26 1 1 1 Pi 4096 1 1 168 1 172 266 ./xhpl 27 1 1 1 Pi 4096 1 1 168 1 171 267 ./xhpl.sh 28 2 2 2 Pi 1024 1 2 168 1 792 0.91 63 12.57 3 ./xhpl.sh no µc upd/3dnw no sse 29 2 1 3 3 Pi 1024 1 3 168 1 121 5.92 ./xhpl.sh no µc upd/3dnw no sse 30 2 1 3 3 Pi 1024 1 3 168 1 249 2.89 170 1.46 0 ./xhpl.sh no µc upd/3dnw no sse 31 2 1 1 4 4 Pi 1024 1 4 168 1 152 4.73 ./xhpl.sh no µc upd/3dnw no sse 32 1 2 1 1 5 5 Pi 1024 1 5 168 1 111 6.46 ./xhpl.sh no µc upd/3dnw no sse 33 1 2 1 1 5 5 Pi 1024 1 5 168 1 125 5.76 ./xhpl.sh no µc upd/3dnw no sse 34 1 1 2 2 2 2 2 1 1 14 14 Pi 1024 1 14 168 1 184 3.9 ./xhpl.sh no µc upd/3dnw no sse 35 1 1 2 2 2 2 2 1 1 14 14 Pi 4096 1 14 168 1 261 175.33 ./xhpl.sh no µc upd/3dnw no sse 36 1 1 2 2 2 2 2 1 1 14 14 Pi 17024 2 7 128 1 388 8484.2 768 0.51 0 ./xhpl.shµcode upd + 3dnow instldno sse 37 2 2 4 4 Pi 4096 2 2 128 1 480 95.4 ./xhpl.shµcode upd + 3dnow instldsse 38 2 2 2 1 7 7 Pi 4096 1 7 128 1 173 264.5 ./xhpl.shµcode upd + 3dnow instldsse 39 1 1 2 2 Pi 4096 1 2 128 1 142 325 ./xhpl.shµcode upd + 3dnow instld 40 1 1 2 2 Pi 4096 1 2 128 1 223 206 ./xhpl.sh 41 2 2 2 6 6 Pi 4096 1 6 128 1 175 262.6 379 0.46 2 ./xhpl.sh 42 2 1 3 3 Pi 4096 1 3 128 1 220 208.3 ./xhpl.sh 43 2 2 1 - 4096 1 1 128 0 2300 10.3 ./xhpl 3dnow 44 1 1 2 2 Pi 6144 1 2 128 1 235 657 ./xhpl 45 2 1 3 3 Pi 4096 1 3 128 1 1392 33 ./xhpl.sh 3dnow sse 46 1 1 2 2 Pi 8192 1 2 128 1 281 1306.3 ./xhpl.sh 47 1 1 1 - 4096 1 1 128 0 177 241 ./xhpl 48 2 2 1 - 4096 1 1 128 0 441 96.71 ./xhpl sse 49 2 2 2 Pi 4096 2 1 128 1 4235 10.86 63 67.22 3 ./xhpl.sh 3dnow 50 2 2 2 Pi 8192 2 1 128 1 4550 80.59 63 72.22 3 ./xhpl.sh 3dnow 51 2 2 2 Pi 4096 1 2 128 1 4128 11.1 ./xhpl.sh 3dnow 52 2 2 2 Pi 8192 2 1 128 1 3050 121.6 ./xhpl.sh 3dnow 53 2 2 2 Pi 1024 2 1 128 1 2700 ./xhpl.sh 3dnow 54 2 2 2 Pi 9088 1 2 128 1 3338 150 ./xhpl.sh 3dnow 55 2 2 1 - 1024 1 1 128 0 55 124 0.44 ./xhpl.x86 no sse 56 2 2 1 - 1024 1 1 128 0 396 124 3.19 ./xhpl.sse sse 57 2 2 2 Pi 4096 2 1 128 0 755 60.8 124 6.09 ./xhpl.sse sse 58 1 1 1 - 6016 1 1 128 0 1230 118.18 107 11.5 ./xhpl.3dnow 3dnow 59 1 1 2 2 2 2 2 1 1 14 14 Pi 4096 1 14 128 1 590 77.7 768 0.77 0 ./xhpl.sh 3dnow sse 60 1 1 1 Pi 4096 1 1 128 1 1142 ./xhpl.sh 3dnow 61 2 1 3 3 Pi 4096 1 3 128 1 2078 170 12.22 1 ./xhpl.sh 3dnow 62 1 1 2 2 Pi 4096 1 2 128 1 894 168 ./xhpl.sh 3dnow 63 1 1 2 2 Pi 4096 2 1 128 1 1553 29.5 ./xhpl.sh 3dnow 64 1 1 1 Pi 4096 1 1 128 1 2348 19.5 60 39.13 ./xhpl.sh 3dnow 65 1 1 1 Pi 4096 1 1 128 1 431 106.43 120 3.59 ./xhpl.sh sse 66 2 2 2 6 6 Pi 4096 2 3 128 1 1131 40.54 379 2.98 2 ./xhpl.sh sse 67 2 2 2 Pi 4096 2 1 128 1 750 61.13 124 6.05 3 ./xhpl.sh sse 68 2 2 4 4 Pi 4096 2 2 128 1 1023 44.83 ./xhpl.sh sse 69 2 2 1 5 5 Pi 4096 1 5 128 1 1330 34.46 ./xhpl.sh 3dnow sse 70 2 2 2 1 7 7 Pi 4096 1 7 128 1 1285 35.7 3dnow sse 71 2 2 1 1 6 6 Pi 4096 1 6 128 1 1386 33.08 3dnow sse 72 2 2 2 1 1 8 8 Pi 4096 1 8 128 1 1391 32.96 3dnow sse 73 1 1 2 2 Pi 4096 1 2 128 1 746 61.42 231 3.23 1 3dnow sse 74 2 2 4 4 Pi 4096 2 2 128 1 2054 22.31 215 9.55 1 3dnow sse 75 2 2 2 Pi 4096 2 1 128 1 1317 34.81 152 8.66 3 sse 76 2 2 2 Pi 14336 2 1 128 1 1490 34.81 148 10.07 3 sse 77 2 2 2 Pi 4096 2 1 128 1 61.1 750.76 120 0.51 3 ./xhpl.x86 xhpl generic 78 1 1 2 2 2 2 2 1 1 14 14 Pi 12000 1 14 128 1 1770 650.8 770 2.3 0 ./xhpl.sh 3dnow sse 95 B.3 Relevant HBC Benchmarks Homogeneous and Heterogeneous node combinations. Consu Homog Pi99 MrCar Mary Edith Sybil Cora MrsHu Anna Daisy # # Mstr Time mpt. Mflop/ Tst # Pi 1 N P Q NB mpi? Mflops eneity 0 Notes Notes 2 Notes 3 1 son 2 2 2 2 1 ghes 2 1 1 cores procs Node (s) Full Watt to 3 load 50 2 2 2 Pi 8192 2 1 128 1 4550 80.59 63 72.22 3 ./xhpl.sh 3dnow 49 2 2 2 Pi 4096 2 1 128 1 4235 10.86 63 67.22 3 ./xhpl.sh 3dnow 64 1 1 1 Pi 4096 1 1 128 1 2348 19.5 60 39.13 ./xhpl.sh 3dnow 61 2 1 3 3 Pi 4096 1 3 128 1 2078 170 12.22 1 ./xhpl.sh 3dnow 74 2 2 4 4 Pi 4096 2 2 128 1 2054 22.31 215 9.55 1 3dnow sse 78 1 1 2 2 2 2 2 1 1 14 14 Pi 12000 1 14 128 1 1770 650.8 770 2.3 0 ./xhpl.sh 3dnow sse 75 2 2 2 Pi 4096 2 1 128 1 1317 34.81 152 8.66 3 sse 58 1 1 1 - 6016 1 1 128 0 1230 118.18 107 11.5 ./xhpl.3dnow 3dnow 66 2 2 2 6 6 Pi 4096 2 3 128 1 1131 40.54 379 2.98 2 ./xhpl.sh sse 28 2 2 2 Pi 1024 1 2 168 1 792 0.91 63 12.57 3 ./xhpl.sh no µc upd/3dnw no sse 57 2 2 2 Pi 4096 2 1 128 0 755 60.8 124 6.09 ./xhpl.sse sse 67 2 2 2 Pi 4096 2 1 128 1 750 61.13 124 6.05 3 ./xhpl.sh sse 73 1 1 2 2 Pi 4096 1 2 128 1 746 61.42 231 3.23 1 3dnow sse 59 1 1 2 2 2 2 2 1 1 14 14 Pi 4096 1 14 128 1 590 77.7 768 0.77 0 ./xhpl.sh 3dnow sse 22 1 1 2 2 6 6 Pi 6000 1 6 168 1 579 248.6 221 2.62 0 ./xhpl.sh no µc upd/3dnw no sse 21 1 1 2 4 4 Pi 6000 1 4 168 1 476 303 69 6.9 0 ./xhpl.sh no µc upd/3dnw no sse 23 1 1 2 2 1 7 7 Pi 6000 1 7 168 1 433 333 328 1.32 0 ./xhpl.sh no µc upd/3dnw no sse 65 1 1 1 Pi 4096 1 1 128 1 431 106.43 120 3.59 ./xhpl.sh sse 56 2 2 1 - 1024 1 1 128 0 396 124 3.19 ./xhpl.sse sse 36 1 1 2 2 2 2 2 1 1 14 14 Pi 17024 2 7 128 1 388 8484.2 768 0.51 0 ./xhpl.sh µcd upd+3dnw no sse 9 1 1 2 4 4 Pi 4000 1 4 168 1 344 156 2.21 1 ./xhpl.sh no µc upd/3dnw 20 1 1 2 2 2 2 2 1 1 14 14 Pi 6000 1 14 168 1 319 451.14 768 0.42 0 ./xhpl.sh no µc upd/3dnw no sse 24 1 1 2 2 Pi 9240 1 2 168 1 304 1728.3 6 50.67 3 ./xhpl no µc upd/3dnw no sse 16 1 2 3 3 Mary 4000 1 3 168 1 279 155 1.8 0 ./xhpl.sh no µc upd/3dnw no sse 30 2 1 3 3 Pi 1024 1 3 168 1 249 2.89 170 1.46 0 ./xhpl.sh no µc upd/3dnw no sse 3 1 1 2 2 Pi 4000 1 2 168 1 233 6 38.83 3 ./xhpl.sh 2 1 1 1 Pi 4000 1 1 168 1 178 3 59.33 3 ./xhpl.sh 41 2 2 2 6 6 Pi 4096 1 6 128 1 175 262.6 379 0.46 2 ./xhpl.sh 11 2 2 2 6 6 Pi 4000 1 6 168 1 171 379 0.45 2 ./xhpl.sh no µc upd/3dnw 1 1 1 1 Pi 4000 1 1 168 1 170 3 56.67 3 ./xhpl.sh 76 2 2 2 Pi 4096 2 1 128 1 61.1 750.76 120 0.51 3 ./xhpl.x86 xhpl generic 55 2 2 1 - 1024 1 1 128 0 55 124 0.44 ./xhpl.x86 no sse

96 Appendix C

Code

97 C Code

C.1 Example icpi-info Taken from the MPICH2 examples included in http://www.mpich.org/static/downloads/1.4.1/mpich2-1.4.1.tar.gz COPYRIGHT file states: Permission is hereby granted to use, reproduce, prepare derivative works, and to redistribute to others. This software was authored by: Mathematics and Computer Science Division , Argonne National Laboratory, Argonne IL 60439 . We modified it modified so that it printed the number and location of each process on the terminal screen. /* -*- Mode: C; c-basic-offset:4 ; -*- */ /* * (C) 2001 by Argonne National Laboratory. * See COPYRIGHT in top-level directory. */

/* This is an interactive version of cpi */ #include "mpi.h" #include #include

double f(double);

double f(double a) { return (4.0 / (1.0 + a*a)); }

int main(int argc,char *argv[]) { int done = 0, n, myid, numprocs, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; double startwtime = 0.0, endwtime; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Get_processor_name(processor_name,&namelen);

/* fprintf(stdout,"Process %d of %d is on %s\n", myid, numprocs, processor_name); fflush(stdout); */

while (!done) { if (myid == 0) { fprintf(stdout, "Enter the number of intervals: (0 quits) "); fflush(stdout);

98 if (scanf("%d",&n) != 1) { fprintf( stdout, "No number entered; quitting\n" ); n = 0; } startwtime = MPI_Wtime(); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) done = 1; else { h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

if (myid == 0) { printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); endwtime = MPI_Wtime(); printf("wall clock time = %f\n", endwtime-startwtime); fflush( stdout ); } } } MPI_Finalize(); return 0; } C.2 The HPL suite This has been obtained from http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz then compiled as explained in the dissertation. The package has been developed at the University of Tennessee, Knoxville, and its COPYRIGHT file states: -- High Performance Computing Linpack Benchmark (HPL) HPL - 2.1 - October 26, 2012 Antoine P. Petitet University of Tennessee, Knoxville Innovative Computing Laboratory (C) Copyright 2000-2008 All Rights Reserved -- Copyright notice and Licensing terms: Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed at the University of Tennessee, Knoxville, Innovative Computing Laboratory.

99 4. The name of the University, the name of the Laboratory, or the names of its contributors may not be used to endorse or promote products derived from this software without specific written permission. C.3 Scripts on Master Node

/hbc/hbc_poweroff.sh

#!/bin/bash

echo "Shutting down all nodes excepting Pi99 and Pi..."

for count in {mrcarson,mary,sybil,edith,mrshughes,daisy,anna,cora} do host=$count if ping -w 2 -c 1 "$host" > /dev/null then # echo status=$? echo Trying to shut down "$host"... ssh "$host" poweroff else # echo status=$? echo "$host" is not reachable. fi # echo "$count" # ping -c1 $count done

#If we wish the Master Node and the Pi99 compute node to power down #as well we add raspberrypi99 in the vector and then we add a line #for manually powering down Master Node raspberrypi.

/hbc/hbc_reboot.sh #!/bin/bash

echo "Rebooting all nodes excepting Pi99 and Pi..."

for count in {mrcarson,mary,sybil,edith,mrshughes,daisy,anna,cora} do host=$count if ping -w 2 -c 1 "$host" > /dev/null then # echo status=$? echo Trying to reboot "$host"... ssh "$host" reboot else # echo status=$? echo "$host" is not reachable. fi # echo "$count" # ping -c1 $count done

#If we wish the Master Node and the Pi99 compute node to reboot #as well, we add raspberrypi99 in the vector and then we add a line #for manually rebooting Master Node raspberrypi.

/hbc/hbc_hosts_alive.sh

100 #!/bin/bash

echo "Finding out which hosts are alive on the network..." echo " Warning: some may be successfully pinged due to an active network car$ echo " but still be off."

for count in {mrcarson,mary,sybil,edith,mrshughes,daisy,anna,cora,raspberrypi99$ do host=$count if ping -w 2 -c 1 "$host" > /dev/null then # echo status=$? echo Can ping "$host" else # echo status=$? echo Cannot ping "$host" fi done

/usr/bin/hbc_start cd /hbc su hbcuser

C.4 The HPL.dat file Below is an example for a single compute node. In order to increment the matrix size we do so by increasing N. Care must be taken not to overload the compute node with the least memory installed as disc swap could make the system stall. P and Q are the process grid parameters. P x Q = total number of cores on the cluster. E.g. 3 computers with 2 nodes each could be P=3 and Q=2. This may be suboptimal if the grid is heterogeneous. The more square the grid is, the better. E.g. for calculations using compute nodes Mary and Edith we would use P=2 and Q=2. HPLinpack benchmark input file HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 4000 Ns 1 # of NBs 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 1 Ps 1 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1)

101 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)

102 Appendix D

Dept Of Education Master Plan Chart 2013-2014

103 D Dept of Education Master Plan Chart, 2013 - 14 Dept. D'Ensenyament. Línies de formació 2013-2014. Departament d'Ensenyament Línies aprovades pel Comitè de Formació del Departament d'Ensenyament en la sessió del dia 30 de març de 2012 PEC Eix Prio- Sub- ritat prioritat 1 Suport als projectes educatius dels centres per a la millora dels processos i resultats educatius 1 Millora dels resultats educatius de tot l'alumnat i de la convivència en els centres, mitjançant l'aplicació dels currículums competencials i l'enfortiment dels aprenentatges claus 1a Desenvolupament competencial del currículum i aprenentatges claus, especialment en llengua i matemàtiques - Definició de criteris per analitzar i millorar activitats de lectura i promoure l'ensenyament explícit de les estratègies i habilitats lectores (Impuls de la lectura) en entorns analògics i digitals - Estratègies d'aula, materials i recursos, inclosos els digitals, que potenciïn el treball matemàtic i l'aplicació dels coneixements i les habilitats adquirides - La biblioteca escolar com un espai de suport al currículum 1b Actualització del programa d'immersió lingüística des de la perspectiva del tractament integrat de llengua i continguts 1c Millora de la competència comunicativa de l'alumnat en llengües estrangeres i extensió de la metodologia AICLE 1d Desenvolupament, seguiment i avaluació de la competència digital de l'alumnat com a competència transversal 1e Participació en projectes internacionals en el marc dels programes d'aprenentatge permanent i de les altres accions impulsades per la Unió Europea 1f Aplicació d'estratègies i recursos per millorar la identificació precoç i la intervenció educativa en el cas d'alumnes amb discapacitats, trastorns d'aprenentatge i altes capacitats 1g Enfortiment dels projectes de convivència com a eina de millora del clima escolar i l'èxit educatiu 1h Promoció d'una educació que potenciï la igualtat real d'oportunitats i l'eliminació de tota mena de discriminació per raó de sexe, i integració de forma explícita i amb continguts d'aprenentatge de la perspectiva de gènere 2 La funció tutorial i l'orientació com a eina clau per a l'acompanyament de l'alumnat i la millora del clima de centre 2a L'orientació educativa, acadèmica i, en el cas de secundària, professional, com a eina de millora de l'alumnat des de la perspectiva acadèmica, social i de l'orientació professional 2b L'acció tutorial i l'orientació personal com a acció educativa que complementa i enriqueix la tasca docent orientadora 2c L'acció tutorial com a eina per millorar la coresponsabilitació escola-família 2d L'orientació tutorial com a eina per millorar la convivència 2e La formació per a la gestió de les pràctiques en centres de treball de l'alumnat de formació professional com a eina d'inserció laboral 2f El foment de l'autoformació integrada, entesa com una metodologia que fomenta l'autoregulació de l'alumnat completant el treball a l'aula amb el treball autònom orientat, en els ensenyaments d'adults 3 Participació de la comunitat educativa i relacions amb l'entorn 3a Desenvolupament dels plans educatius d'entorn i altres projectes educatius comunitaris 3b Promoció del treball i l'aprenentatge en xarxa com una eina per elaborar projectes comunitaris i construir una acció contínua i coherent entre els diferents agents

104 educatius d'un territori 3c Desenvolupament del servei comunitari (voluntariat) i les accions de compromís cívic que faciliten l'aprenentatge en l'exercici actiu de la ciutadania 3d Enfortiment de les comunitats d'aprenentatge com a projecte per aconseguir l'èxit educatiu per a tothom i la millora de la convivència 3e Potenciació de recursos i estratègies per a la dinamització dels espais de comunicació i de participació de la comunitat educativa 4 Recursos i estratègies per millorar la capacitat i la competència dels equips docents per dur a terme una adequada atenció a la diversitat de l'alumnat amb necessitats educatives específiques 4a Avaluació psicopedagògica per a la identificació de necessitats educatives i elaboració de propostes d¿intervenció 4b Intervenció educativa per afavorir l'aprenentatge de l'alumnat amb trastorns de conducta, trastorns d'aprenentatge i discapacitats 4c Inclusió digital: accessibilitat dels recursos educatius, dels entorns de treball digitals i dels suports digitals 4d Incorporació al nostre sistema educatiu de l'alumnat d'incorporació tardana, tant des del punt de vista lingüístic com emocional: aules d'acollida i itinerari d'incorporació a l'aula ordinària 4e Educació intercultural com a resposta pedagògica per preparar la ciutadania perquè pugui desenvolupar-se en una societat plural i democràtica 5 Impuls per a la direcció i el lideratge dels centres públics per a una gestió i organització afavoridores de l'èxit educatiu de l'alumnat 5a Formació inicial, permanent d'actualització i professionalitzadora per a la direcció i el lideratge dels centres públics 5b Formació en qualitat i millora contínua adreçada als centres de formació professional i de règim especial per a la millora de la gestió en el camí de l'excel·lència 6 Coordinació de centres com a eina afavoridora d'un model de treball i d'aprenentatge en xarxa 6a Propostes de formació orientades a la coordinació i l'intercanvi entre persones responsables en algun àmbit de gestió dels centres educatius 6b Enfortiment de les xarxes de treball intercentres que afavoreixin l'intercanvi d'experiències i la transferència entre centres i al sistema 6c Grups de treball interprofessionals en el marc dels plans educatius d'entorn

PROFESSORAT Eix Prio- Sub- ritat prioritat 2 Millora de les competències professionals del professorat per a la seva pràctica docent i de l'altre personal adscrit als centres educatius 7 Suport al professorat novell amb la finalitat d'acollir-lo, acompanyar-lo i vetllar pel seu creixement professional 8 Capacitació i actualització metodològica, científica, tècnica i didàctica 8a Formació i actualització lingüística del professorat 8b Formació i actualització del professorat per afavorir, almenys, el domini d'una llengua estrangera 8c Competència comunicativa, metodològica i didàctica, en l'àmbit del plurilingüisme, de les llengües estrangeres curriculars 8d Formació específica del professorat de formació professional, programes de qualificació professional inicial (PQPI), ensenyaments de règim especial i formació de persones adultes 8e Programa d'actualització científica, tècnica i didàctica 8f Salut docent i prevenció de riscos laborals 8g Competència digital del professorat com a transversal per a la pràctica docent en les seves dimensions tecnològica, comunicativa, metodològica, axiològica i per al desenvolupament professional

FORMADORS Eix Prio- Sub-

105 ritat prioritat 3 Millora de les competències professionals de les persones formadores i dels serveis educatius 9 Formació de persones formadores en coherència amb les prioritats de formació 9a Formació com a eina de canvi associada a l'èxit escolar de l'alumnat 9b Disseny, organització i avaluació de la formació (transferència i impacte) 9c Potenciació de la competència digital docent dels equips de persones formadores com a competència transversal 10 Formació adreçada als serveis educatius 10a Millorar la competència dels professionals dels serveis educatius: - En relació amb l'avaluació psicopedagògica dels alumnes i amb l'assessorament als centres i al professorat per avançar en l'atenció de la diversitat a l'aula i, preferentment, en l'atenció a l'alumnat amb trastorns d'aprenentatge, trastorns de conducta, altes capacitats o necessitats educatives específiques. - En relació amb l'enfortiment del perfil dels gestors de la formació i la seva capacitat per avançar en processos d'avaluació de la transferència i l'impacte de la formació als centres i a les aules. - En relació amb els processos de tractament integrat de llengua i continguts (immersió a infantil, primària i secundària) i amb el treball de la comprensió lectora. http://www.xtec.cat/web/formacio/linies_formacio

106 Appendix E

References

107 E References E.1 Ian Livingstone, article on empowering through learning code, V3 magazine: http://2012tech.v3.co.uk/live/Ian-Livingstone.htm. Accessed March 2013. E.2 Korean schools welcome robot teachers, CNET, 28 Dec 2010. http://news.cnet.com/8301-17938_105-20026714-1.html. Accessed March 2013. E.3 J. Verdaguer, Un laptop de 300€ per l'ensenyament, Jornades de Programari Lliure. ETSEIB-UPC. Barcelona 5-8 de juliol de 2006. E.4 Magnet Programme for succeeding in education at Fundacio Bofill. http://www.fbofill.cat/?codmenu=01¬=962 E.5 John Naughton, Why all our kids should be taught how to code, The Guardian, Saturday 31 March 2012. http://www.guardian.co.uk/education/2012/mar/31/why-kids-should-be- taught-code. Accessed December 2012. E.6 Everis, Consulting, IT & Outsourcing Professional Services, Corporate Website. http://www.everis.com/global/en-US/home/Paginas/home.aspx. Accessed December 2012. E.7 The Everis poll, Everis Website. http://www.everis.com/catalonia/WCLibraryRepository/Factors %20influents%20eleccio%20estudis%20CTM.PDF. http://www.everis.com/catalonia/ca-ES/sala-de- premsa/noticies/Paginas/estudi-ctm-everis.aspx . Accessed December 2012. E.8 The term 'college' is being used here in the European (UK and EIRE) meaning. E.9 Stem, Wikipedia: http://en.wiktionary.org/wiki/stem. Accessed March 2013. E.10 National Science Foundation, Fabricating the Future: 3-D Printing Molds, New K- 12 STEM Model. http://www.nsf.gov/mobile/discoveries/disc_summ.jsp? cntn_id=127769&org=NSF. Accessed March 2013. E.11 Eben Upton, Founder, Raspberry Pi. http://www.jbs.cam.ac.uk/media/2012/raspberry-pi/. Accessed March 2013. E.12 Robert G Brown, Beowulf Resources, Duke Physics, Durham, NC, USA. http://www.phy.duke.edu/~rgb/Beowulf/beowulf.php. Accessed August 2013. E.13 3DNow, Wikipedia. http://en.wikipedia.org/wiki/3DNow! . Accessed August 2013. E.14 Streaming SIMD Extensions, Wikipedia.

108 http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions. Accessed September 2012. E.15 ARM Thumb Instruction Set, ARM Information Center. http://infocenter.arm.com/help/index.jsp? topic=/com.arm.doc.dui0473c/CEGBEIJB.html.Accessed September 2012. E.16 Occam Programming Language, Wikipedia. http://en.wikipedia.org/wiki/Occam_programming_language. Accessed September 2012. E.17 HP Support, HP Kayak XM600 Workstation Specs and Installation Guide, HP Website. http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/lpv06 816/lpv06816.pdf. Accessed September 2012. E.18 Raspberry Pi Diagram, ComputerLearnHow.com. http://computerlearnhow.com/wp-content/uploads/2012/05/Pi- diagram.jpg. Accessed July 2013. E.19 RPI Software, Elinux. http://elinux.org/RPi_Software. Accessed July 2013. E.20 RPI Software Architecture, Elinux. http://elinux.org/images/a/a2/VideoCore-Architecture-and-Source.png. Accessed July 2013. E.21 Raspberry Pi Videocore , Elinux. http://elinux.org/RPi_VideoCore_APIs. Accessed July 2013. E.22 Roger Kay, The PC Industry is digging its own Grave, Forbes online. http://www.forbes.com/sites/rogerkay/2013/04/12/the-pc-industry-is- digging-its-own-grave/. Accessed March 2013. E.23 Ubuntu now fits your phone; introducing the superphone that's also a full PC. Ubuntu Website. http://www.ubuntu.com/phone. Accessed August 2013. E.24 The BeagleBoard BeagleBone cards, BeagleBoard.org Foundation (a non-profit US- based corporation). http://www.beagleboard.org/products/beaglebone . Accessed August 2013. E.25 GPU Processing API, RaspberryPi Forum. http://www.raspberrypi.org/phpBB3/viewtopic.php?f=33&t=6188. Accessed August 2013. E.26 Jacek Radajewski and Douglas Eadline, Beowulf HOWTO, v1.1.1, MIT, 22 November 1998. http://stuff.mit.edu/afs/athena/system/rhlinux/redhat-6.2- docs/HOWTOS/Beowulf-HOWTO Accessed December 2012.

109 E.27 Samsung's Exynos 5 Octa Specifications, Ubergizmo.com. http://www.ubergizmo.com/2013/01/samsung-exynos-5-octa-specs . Accessed August 2013. E.28 Linus Torvalds, Historic comp.os.minix post by Linus Torvalds, 25 August 1991. http://groups.google.com/group/comp.os.minix/msg/b813d52cbc5a044b? dmode=source&pli=1. Accessed December 2012. E.29 Industrial Light and Magic, Wikipedia. http://en.wikipedia.org/wiki/Industrial_Light_%26_Magic . Star Wars, Wikipedia. http://en.wikipedia.org/wiki/Star_Wars. Accessed December 2012. E.30 http://en.wikipedia.org/wiki/Young_Sherlock_Holmes . Accessed December 2012. E.31 Jurassic Park (film), Wikipedia. http://en.wikipedia.org/wiki/Jurassic_Park_(film ). Accessed December 2012. E.32 Canonical's Ubuntu 10.04 LTS Server Edition features the ideal deployment platform for Linux server workloads and http://insights.ubuntu.com/news/press-releases/canonicals-ubuntu-10-04-lts-server- edition-features-the-ideal-deployment-platform-for-linux-server-workloads-and-cloud- computing/. Accessed August 2013. E.33 The Linpack project, Netlib.org. http://www.netlib.org/linpack/. Accessed July 2013. E.34 Million Instructions Per Second, Wikipedia. http://en.wikipedia.org/wiki/Million_instructions_per_second#Million _instructions_per_second. Accessed December 2012. E.35 Linpack, a collection of Fortran subroutines to benchmark systems, Netlib. http://www.netlib.org/linpack/ .Accessed August 2013. E.36 Download older, non parallelised Linpack benchmark tool from http://www.netlib.org/benchmark/linpackc.new .Accessed August 2013. E.37 3DNow, Wikipedia. http://en.wikipedia.org/wiki/3DNow !. Accessed August 2013. E.38 Performance per Watt, Wikipedia. http://en.wikipedia.org/wiki/Performance_per_watt . Accessed July 2013. E.39 SSE, Streaming SIMD Extensions, Wikipedia. http://en.wikipedia.org/wiki/SSE. Accessed December 2012. E.40 Raspbian Archive, mpich2.

110 ftp://ftp.mirrorservice.org/sites/archive.raspbian.org/raspbian/pool /main/m/mpich2/mpich2_1.4.1-1_armhf.deb. Accessed August 2013. E.41 Beowulf Benchmarking, Summer 2004 beowulf benchmarking project. http://10.pins1.xdsl.nauticom.net/bvds/computing/Beowulf_Benchmark.h tm. Accessed August 2013. E.42 Compiling MPICH2 for Android and Running on two Phones, And Thus Goes By Another Day. http://hex.ro/wp/projects/personal-cloud- computing/compiling-mpich2-for-android-and-running-on-two-phones/. Accessed August 2013. E.43 No Copyright Infringment Intended. All names belong to their respective owners. Copyright Disclaimer Under Section 107 of the Copyright Act 1976, allowance is made for "fair use" for purposes such as criticism, comment, news reporting, teaching, scholarship, and research. Fair use is a use permitted by copyright statute that might otherwise be infringing. Non-profit, educational or personal use tips the balance in favour of fair use. E.44 Sivarama P. Dandamudi, Guide to RISC Processors for Programmers and Engineers, Springer, 2005. E.45 Hans-Peter Messmer, The Indispensable Pentium Book, Addison-Wesley, 1995. E.46 A. Carling, Parallel Processing: The Transputer and Occam, Sigma Press, 1988. E.47 Carla Schroder, Linux Cookbook, O'Reilly, 2005.

111