High Performance Computing on an IBM Cell Processor

High Performance Computing on an IBM Cell Processor Bioinformatics May08-24 Final Report

Client: Iowa State University Department of Electrical and Computer Engineering

Faculty Advisor: Dr. Zhao Zhang

Team Members: Kyle Byerly Matt Rohlf Shannon McCormick Bryan Venteicher

Submitted: May 5, 2008

DISCLAIMER: This document was developed as a part of the requirements of an electrical and computer engineering course at Iowa State University, Ames, Iowa. This document does not constitute a professional engineering design or a professional land surveying document. Although the information is intended to be accurate, the associated students, faculty, and Iowa State University make no claims, promises, or guarantees about the accuracy, completeness, quality, or adequacy of the information. The user of this document shall ensure that any such use does not violate any laws with regard to professional licensing and certification requirements. This use includes any work resulting from this student-prepared document that is required to be under the responsible charge of a licensed engineer or surveyor. This document is copyrighted by the students who produced this document and the associated faculty advisors. No part may be reproduced without the written permission of the senior design course coordinator

ii Table of Contents 1. Requirement Specifications...... 1 1.1 Problem/Need Statement...... 1 1.2 Proposed Solution...... 2 1.3 Concept Sketch...... 2 1.4 System Description...... 2 1.5 Operating Environment...... 3 1.6 User Interface Description...... 3 1.7 Functional Requirements...... 3 1.8 Non-functional Requirements...... 3 1.9 Market/Literature Survey...... 3 1.10 Deliverables...... 3 2 Project Plan...... 3 2.1 Work Breakdown Structure...... 3 2.2 Resources...... 8 2.2.1 Organizational Chart...... 9 2.2.2 Personnel...... 9 2.2.3 Materials...... 10 2.2.4 Financial...... 10 2.3 Project Schedule...... 11 2.3.1 Project Gantt Chart...... 11 2.3.2 Deliverable Schedule Gantt Chart...... 13 3.0 Project Design...... 15 3.1 Design Method and Engineering Specifications...... 15 3.1.1 Design Method...... 15 3.1.2 Input Specification...... 16 3.1.3 Output Specification...... 16 3.1.4 User Interface Specification...... 16 3.1.5 Hardware Specification...... 17 3.1.6 Software Specification...... 17 3.1.7 Test Specification...... 18 4.0 Implementation...... 19 4.1 Ported Program Selection...... 19 4.2 Explanation of DNAPenny...... 20 4.3 Previous Parallelization of DNAPenny...... 20 4.4 Cell/B.E. Parallelization Models...... 20 4.5 Libspe2...... 21 4.6 ClustalW Cell/B.E. Prototype...... 21 5 Testing...... 23 7.0 Other stuff...... 28 8.0 References...... 29

i List of Figures Figure 1: System Block Diagram...... 2 Figure 2: Work Breakdown Structure...... 4 Figure 3: Organization Chart...... 9 Figure 4: Project Schedule Gantt Chart...... 12 Figure 5: Project Deliverables Gantt Chart...... 14 Figure 6 – infile.orig...... 26 Figure 7 - in_b17x1237.txt...... 26 Figure 8 - in_b12x1236.txt...... 27 Figure 9 - in_b10x1219.txt...... 27 Figure 10 - in_a18x822.txt...... 28 Figure 11 - in_a15x822.txt...... 28

ii List of Tables Table 1: Personnel Resources...... 9 Table 2: Hours Fall 2007 semester, excluding first three weeks...... 10 Table 3: Materials Resources...... 10 Table 4: Financial Resources...... 10 Table 5 - infile.orig...... 24 Table 6 - in_b17x1237.txt...... 24 Table 7 - in_b12x1236.txt...... 25 Table 8 - in_b10x1219.txt...... 25 Table 9 - in_a18x822.txt...... 25 Table 10 - in_a15x822.txt...... 25 Table 11: foo...... 29

iii List of Definitions

BioPerf: Suite of bioinformatics applications packaged to benchmark

Cell/B.E: Cell Broadband Engine; advanced microprocessor architecture designed by Sony, Toshiba, and IBM

EIB: Element Interconnect Bus; a high speed bus in the Cell/B.E. connecting the PPE and the SPEs

GProf: The GNU Profiler; application which profiles applications

HPC: High performance computing; comprises parallel applications on supercomputers or computer clusters

Linux: Free open-source operating system modeled after Unix

PowerPC: A family of RISC processors designed by Apple, IBM, and Motorola

PPE: Power Processor Element; PowerPC processor core with some extensions

RAM: Random access memory; most common type of computer memory which is used by applications to perform essential tasks

Sony PlayStation 3: Video game console released in fall 2006

SPE: Synergistic Processing Elements; specialized vector processor found in the Cell/B.E.

iv 1. Requirement Specifications This section contains the problem/need statement, concept sketch, system block diagram, system description, operating environment, user interface description, functional requirements, non-functional requirements, market/literature survey, and the deliverables.

Section I.1 Problem/Need Statement The advent of high performance super computers in the 1980’s allowed researchers in biology, physics, engineering, and mathematics to tackle ever increasing complex problems. However, until recently, the computer systems needed to do such research were extremely expensive and remained out of reach of all but the well-funded researchers and institutions. The significant technological advancements in the last decade have put high performance computing within the reach of even modest budgets.

Biological researchers are faced with ever increasing computational time due to the exponentially growing data needed to be processed. Currently commodity computing hardware is unable to provide adequate performance. However, the team believes that the Cell Broadband Engine (Cell/B.E.) found in the Sony PlayStation 3 (PS3) will be able to achieve superior performance at a similar cost.

Constraints do not seem to be much of a factor for this project. This project does not require any budget because the PlayStation 3 is provided by the client and the software to be ported will be open source software. A list of constraint considerations is provided below.  Only one PS3 available for group to use.  Simulator provided is not useful for speedup comparisons.  Only one book on bioinformatics algorithms available.

The team will follow the following approach to port the applications to the Cell/B.E.

1.1.1 Familiarization with Programming on the Cell/B.E. Learning how to program on the Cell/B.E. is the first step in completing this project. This will include learning how to utilize the SPEs, and how the PPE and the SPEs coordinate with one another. The group will complete some of the labs provided on a M.I.T. website which will further aid in learning how to program the Cell/B.E.

1.1.2 Determining Which Application to Port to the Cell/B.E. A choice of application is to be made to port to the Cell/B.E. Research into what work has already been done in porting these applications will be performed to ensure that the group is not working on a project that has already been started/finished by others working in the field.

1.1.3 Familiarization with the Application and Algorithms In order to port the application to the Cell/B.E., the group must first understand the original application. This will include reading the source code and learning

1 about the underlying algorithms implemented in these application. A book on algorithms provided by the faculty advisor will aid in the process of learning these algorithms and will be studied by all members of the group.

1.1.4 Porting the Application to the Cell/B.E. This is the actual task of modifying and re-compiling the application to run on the Cell/B.E.

Section I.2 Proposed Solution The team believes the Cell/B.E. found in the PlayStation 3 can provide exceptional performance when compared to traditional computers as a similar price point. The team will port the application from the BioPerf benchmark suite to the Cell/B.E.

Section I.3 Concept Sketch The team is working on porting a bioinformatics application from the BioPerf benchmark suite to the Cell/B.E. processor. The Cell/B.E. processor is well suited to provide high- performance in the computationally invasive application. The team is not creating anything new; instead the team is making existing application perform better.

Section I.4 System Description The Cell/B.E. is comprised of a PPE and up to eight SPEs. The PPE is responsible for running the operating system and coordinating the SPEs. Each SPE is an independent vector processor capable of doing four operations at once. The PPE and SPEs are connected by a high speed bus called the Element Interconnect Bus (EIB).

The system block diagram below shows the same data being inputted into the same application. The application on the top are not being run on a PlayStation 3, while the application on the bottom have been ported to take advantage of the Cell/B.E. found in the PlayStation 3.

Figure 1: System Block Diagram

2 Section I.5 Operating Environment The ported application will execute on the Linux operating system running on the Sony PlayStation 3. The PlayStation 3 will be stored in a dry and temperature controlled environment.

Section I.6 User Interface Description The existing applications are all command line based. The team has no reason to change the interfaces.

Section I.7 Functional Requirements Below is a list of functional requirements for the ported application.  Application ported shall run on the Cell/B.E.  Ported application shall return the same results as the original application.

 Ported application shall return its running time for comparison to original application.

Section I.8 Non-functional Requirements Below is a list of non-functional requirements for the ported application.  The ported application shall run faster with the SPEs than without.  The user interface will not be altered.

Section I.9 Market/Literature Survey There are not any paying consumers for the team’s deliverables. However, there are researchers that would be interested in using the ported application to reduce the time they spend computing results.

There is previous work done by a group of researchers to port parts of the BioPerf suite to the Cell/B.E. [Sachdeva]. The team’s work will be similar that of Sachdeva. However, the team will port an application that was not ported by Sachdeva.

Section I.10 Deliverables The team will deliver the source code of the ported application and benchmarks comparing the ported and un-ported application. Also, the team will deliver the project plan, design document, poster, website, and presentations. Article II. Project Plan This section contains the work breakdown, resource requirements, and the project schedule.

Section II.1 Work Breakdown Structure This section contains the work breakdown structure.

3 Figure 2: Work Breakdown Structure

4 Task No. 1 – Problem Definition Task Objective: To determine the scope of the project considered, and to decide what is to be done in terms of speed comparisons and benchmarking.

Task Approach: Meeting with the client and faculty advisor. Researching the BioPerf website and technical websites to determine what work is already in progress/completed in porting the application to the Cell/B.E.

Task Expected Results: Which application to port and which other processors to compare run-times with.

Subtask 1a – Researching the Programming of the Cell/B.E. Subtask Objective: The team needs to become familiar with programming on the Cell/B.E., the libraries available for the Cell/B.E. and how the SPEs can be used in a useful and efficient way.

Subtask Approach: Studying the M.I.T. lecture slides provided by the faculty advisor and completing various labs offered on the M.I.T. lecture slides. Researching other technical sites which focus on programming on the Cell/B.E.

Subtask Expected Results: A better understanding of how to program the Cell/B.E., as well as how it works and how to use the SPEs in a useful way.

Subtask 1b – Research of BioPerf Suite Subtask Objective: To become familiar with the BioPerf suite of applications and to aid in the decision of what application to port.

Subtask Approach: Reading documentation provided with the applications in the BioPerf suite, as well as determining which applications have already been ported or are in the process of being ported to the Cell/B.E.

Subtask Expected Results: Aid in the decision of which application to port to the Cell/B.E. Understanding the application that is to be ported and how the Cell/B.E. may be able to reduce the run-time of the application with proper use of the SPEs in the Cell/B.E.

Subtask 1c – Research Parallel Algorithms Subtask Objective: Gain a better understanding of the underlying algorithms utilized in bioinformatics applications.

Subtask Approach: Read and understand the algorithms book provided by the faculty advisor.

5 Subtask Expected Results: An understanding of the algorithms present in bioinformatics applications sufficient enough to allow the team to port one of the applications to the Cell/B.E.

Subtask 1d – Research Existing Results Subtask Objective: To identify applications already ported to the Cell/B.E. so redundant work is avoided.

Subtask Approach: Studying technical papers on the Cell/B.E. in order to identify any porting already in progress.

Subtask Expected Results: Knowledge of previous work done in porting the applications to the Cell/B.E. which will aid in the decision of which application the group will port.

Task 2 – Technology and Implementation Considerations and Selection Task Objective: Find the software that is best suited for the project.

Task Approach: This will be broken up into two main parts. Finding the best suited Linux distribution for the team’s task, and determining which bioinformatics application to port.

Task Expected Results: A pairing of Linux distribution and application to port.

Subtask 2a – Decide Which Linux to Install Subtask Objective: Find the distribution of Linux that will be the most pertinent to install.

Subtask Approach: Evaluate different distributions, such as Ubuntu, Red Hat Enterprise Linux, Fedora Core, or Yellowdog for their suitability for the task. Considerations include ease of install, the team’s prior knowledge of the different Linux distributions and importantly, which have the greatest support for the bioinformatics software the team will be optimizing for the Cell/B.E.

Subtask Expected Results: A distribution of Linux chosen that best meets the team’s needs.

Subtask 2b – Decide Which Specific Application to Port Subtask Objective: Using the team’s prior research of the BioPerf suite the team will evaluate the individual applications that make up the suite individually for suitability of porting.

Subtask Approach: The team will find applications that have many elements that the team feels would be efficient to parallelize on the Cell/B.E., and that would be possible for a team of four to accomplish. The team will find these by analyzing

6 the code base of the individual applications, as well as finding algorithms that are parallelizable or uniquely suited for the Cell/B.E.

Subtask Expected Results: An application hand selected for porting.

Task 3 – End-Product Design Task Objective: Team will have design for the end-product.

Task Approach: The end-product design process will contain three parts: the design requirements, design process, and design documentation.

Task Expected Results: The team will have a design for the end-product.

Subtask 3a – Design Requirements Subtask Objective: The team will understand the end-product design requirements.

Subtask Approach: The group will consult with the faculty advisor and perform research to determine the end-product requirements. The team will base the discussion on the requirements outlined above.

Subtask Expected Results: The group will have a mastery of the end- requirements and the process that will used to create the end-product.

Subtask 3b – Design Process Subtask Objective: The team will identify the process to successfully create the end-product design.

Subtask Approach: The team will consult with the faculty advisor and conduct research to determine the best processes to use to implement the end-product design.

Subtask Expected Results: The team will know the process to be used to implement the end-product design.

Subtask 3c – Design Documentation Subtask Objective: The team will have records of the methods to be used to in the end-product design.

Subtask Approach: Each team member will record the design in a notebook or digital format and all emails, meeting minutes, and other communications will be saved so they be can retrieved by the team or other interested party if needed.

Subtask Expected Results: The team will have thorough records of the end- product design process.

7 Task 4 – End-product Prototype Implementation Task Objective: An implemented end-product prototype running on the Cell/B.E.

Task Approach: Part of the BioPerf suite will be compiled for the Cell/B.E. It will be tested on a simulator, and once the simulation is found to be working correctly, it will be executed on the Cell/B.E.

Task Expected Results: A working version of BioPerf software running on the Cell/B.E.

Task 5 – End-Product Testing Task Objective: Test the implementation of the chosen ports for correctness and speed.

Task Approach: This will be broken up into two parts. The correctness of the final output data, and the degree of performance enhancement.

Task Expected Results: Benchmarks usable for comparisons as well as an application that works correctly.

Subtask 5a – Ensure Consistency of Output Data Subtask Objective: Ensure that the application output correct data.

Subtask Approach: This will be accomplished by comparing the output on known datasets to known correct results.

Subtask Expected Results: Application that output correct data.

Subtask 5b – Benchmarking Subtask Objective: Compare the speed of the application on the Cell/B.E. to those on multi-core or single core systems.

Subtask Approach: Do timing analysis on the Cell/B.E. of the application the team ported and then do the same on single and multi-core machines.

Subtask Expected Results: Benchmarks that clearly show what improvements were made by porting to the Cell/B.E.

Section II.2 Resources This section gives detailed information about the resources needed to complete this project. Included are detailed time projections for each team member as well as projected costs for materials and labor.

8 (a) Organizational Chart This section contains the organization chart for the project. Dr. Smith is the course coordinator. Zhao Zhang is the project advisor. The team members are Bryan Venteicher, Matt Rohlf, Shannon McCormick, and Kyle Byerly.

Dr. Smith Zhao Zhang Course Coordinator Project Advisor

Bryan Venteicher Project Leader

Shannon McCormick Matt Rohlf Kyle Byerly Communication Coordinator Figure 3: Organization Chart

(b) Personnel This table shows the amount of time each team member is expected to work on each project task.

Table 1 Team Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Totals Member Kyle 25 10 5 80 15 10 12 35 192 Byerly Shannon 25 8 5 80 15 10 12 35 190 McCormick Matt Rohlf 25 10 5 80 15 10 12 35 192 Bryan 25 8 5 80 15 10 12 35 190 Venteicher Totals 100 36 20 320 60 40 48 140 764

Table 1: Personnel Resources

This table shows the amount of time each team member spent on the project during the Fall 2007 semester, excluding the first three weeks of the semester.

9 Table 2 Team Member Hours Kyle Byerly 76.5 Shannon McCormick 80 Matt Rohlf 68.5 Bryan Venteicher 64.5 Total 289.5 Table 2: Hours Fall 2007 semester, excluding first three weeks

(c) Materials This table shows the estimated costs for each item needed to complete the project.

Table 3 Item Team Hours Other Hours Cost Parts and Materials: a. Algorithms Book 0 0 $100 b. Algorithms Book 0 0 Donated c. PlayStation 3 0 0 Donated Totals 0 0 $100 Table 3: Materials Resources

(d) Financial This table shows the financial resources needed to complete the project.

Table 4 Item W/O Labor With Labor Parts and Materials: a. Algorithms Book $100.00 $100.00 b. Algorithms Book Donated Donated c. PlayStation 3 Donated Donated Subtotal $100.00 $100.00 Labor at $10.00 per hour: a. Kyle Byerly $0.00 $1920.00 b. Shannon McCormick $0.00 $1900.00 c. Matt Rohlf $0.00 $1920.00 d. Bryan Venteicher $0.00 $1900.00 Subtotal $0.00 $7640.00 Total $100.00 $7740.00 Table 4: Financial Resources

10 Section II.3 Project Schedule This section contains Gantt charts that give a detailed schedule for the project. The Gantt charts list the expected start and end dates for each subtask.

(a) Project Gantt Chart This section is composed of a Gantt chart that details all tasks and sub-tasks for this project.

11 Figure 4: Project Schedule Gantt Chart

12 (b) Deliverable Schedule Gantt Chart This section contains a Gantt chart which shows only the project deliverables.

Figure 5: Project Deliverables Gantt Chart

13 Article III. Project Design This section contains the design method, engineering specifications and the software design documents.

Section III.1 Design Method and Engineering Specifications This section contains the design method, and the input, output, user interface, hardware, software, and test specifications for the project.

(a) Design Method The unique architecture of the Cell/B.E. makes it a challenge to port applications to. The three porting strategies determined by the team are listed below. Generally, the more involved each strategy is means that there is a larger opportunity for a performance increase.

The IBM provided XLC complier is able to generate code for both the PPU and the SPEs without any modifications done by the programmer. This means that the same source code that was compiled on a typical personal computer could be compiled on the Cell/B.E. and that the Cell/B.E. executable would take advantage of the SPEs. The XLC compiler analyzes the source code to determine which portions would be suitable for execution on the SPEs. While this method places virtually no burden on the programmer, the typical performance improvement is quite small. This is due to the immaturity of Cell/B.E. compiler and the difficulty in determining which blocks of code are suitable for execution on the SPEs.

The next two methods involve the programmer making changes to the application source code. Such modifications are time consuming but can lead to significant performance improvements. In the first method, the functional logic of the application is kept intact, but sections of the application are broken apart so that they can run in threads on the SPEs. Thus, at least some of the application’s computation load is distributed across the available SPEs. Moreover, the programmer does not modify the application code that runs on the SPEs. Rather, that task is left to the compiler.

In the final strategy, functional changes to the application code are done by the programmer. Not only is the application broken up into threads which execute on the SPEs, the application code in each thread is rewritten to exploit the Cell/B.E. architecture. The source code must be vectorized in order to achieve significant performance improvements. However, this is a time consuming process, and there is a potential to introduce bugs into the vectorized program logic.

The team expects to profile the un-ported applications in the BioPerf suite to determine which functions in each application constitute a significant amount of

14 execution time. The input to each application will be what is provided by the BioPerf suite. From the profiling data, the hot functions (the ones that use a significant amount of CPU time) will be analyzed and the team will determine which of the strategies would be suitable for porting to the Cell/B.E.

(b) Input Specification The input to the system will be a bioinformatics application chosen from the BioPerf suite. In selection of the application, there are several considerations which must be made.

In order to prevent repeating already accomplished work, the group should choose an application from the BioPerf suite which has not already been ported to run on the Cell/B.E.

The datasets for the application must fit into the memory space of the Cell/B.E. This is rather limiting due to the fact that many bioinformatics applications utilize large working sets of data. There has been previous work done in porting FASTA and HMMER, two of the applications from the BioPerf suite, where the researchers discovered that the size of the working sets exceeded the 256 KB local storage capacity on the SPEs.

The application also must be able to be parallelized. Some applications may simply not be able to be run in parallel because it may cause a large amount of data dependencies, which will inherently slow down the parallelization because of the constant need to wait for earlier computations. Currently, the group is conducting research into which characteristics and criteria should be present in the applications considered for selection for this project.

(c) Output Specification The ported application from the BioPerf Suite will be compared using time of execution. Profiling data from GProf will be used to show speed up of specific functions in the ported applications. The output from the group’s parallelized version of the code will be compared to the original output to ensure correctness.

(d) User Interface Specification The team will not alter the existing user interface provided with BioPerf. The BioPerf user interface consists of a series of shell scripts which provide a menu for the user to select which applications one wishes to benchmark. An example of the user interfaced is shown below.

$ sh use-bioperf.sh You can do the following: [R] Run BioPerf [I] Install BioPerf on your architecture (if your architecture is not PowerPc, x86) [C] Clean outputs in /home/seniord/BioPerf.profile/Outputs [D] Display all versions of the installed codes

15 (e) Hardware Specification Target hardware platform: Sony PlayStation 3 The PlayStation 3 utilizes a Cell/B.E. processor, which is a parallel processor jointly developed by Sony, Toshiba, and IBM. The Cell/B.E. in the PS3 is comprised of a single PPE and six available SPEs.

The PPE is a standard two-way multithreaded 64-bit Power-based CPU, and acts as a controller for the SPE cores. It operates at 3.2 GHz. The PPE contains a 32 KB instruction and 32 KB data L1 cache, as well as a 512 KB L2 cache. The PPE also contains an AltiVec vector processing unit, which can compute two double- precision floating point operations per clock cycle.

The SPEs act as the workhorses of the Cell/B.E. They run at the same 3.2 GHz as the PPE, and are capable of performing four 32-bit integer operations or four single-precision floating point operations per clock cycle. Each SPE also contains 256 KB of embedded RAM that acts as a “Local Storage” for the SPE and can be accessed by both its SPE and the PPE. The Memory Flow Controller, or MFC, is responsible for transferring data to and from the LS.

In addition, the PlayStation 3 is outfitted with 256 MB of XDR DRAM, clocked at the CPU’s die speed. This may be a noticeable obstacle with the Cell/B.E, since there is significantly less RAM in the PS3 when compared to a typical modern desktop PC.

The PlayStation 3 also has a 60 GB 5400RPM SATA hard disk and a gigabit Ethernet port.

The team’s comparison hardware platform will be a server. It contains a quad- core Intel Xeon x86 processor running at 3.0 GHz with a 512 KB cache. The comparison PC also contains 1 GB of RAM, a sufficiently large hard disk, and a gigabit Ethernet card. The team will use this machine unless a more modern and/or faster machine can be acquired.

(f) Software Specification The team will use the Fedora Linux distribution installed on the PlayStation 3. At the time of writing, Fedora 7 is available and recommended for use. The Linux kernel version is used 2.6.23.1. The IBM SDK for the Cell/B.E. is currently at version 3.0. This includes version 9.0 of the XLC compiler. Version 4.1.1 of the GNU C compiler with support for the Cell/B.E is also installed. The GNU profiler version used is 2.17.5.

Since the Cell/B.E. is relatively new, updated versions of applications are frequently released. The team expects to keep to up to date with the applications in order to get bug fixes and performance improvements.

16 (g) Test Specification Parallelizing the BioPerf applications should lead to an overall speedup in the execution of the application. This will be verified by performing several tests upon completion of the project.

The ported application will be tested running only on the PPE, then on the PPE with one SPE, next with the PPE and three SPEs, and then on the PPE and utilizing all six SPEs available on the PlayStation 3.

The group will mainly gather data on runtimes of the specific tests. Comparisons in runtimes will be made with data collected from tests on the Cell/B.E. with data from other more generic processors. In these comparisons the group hopes to see a noticeable reduction in runtime for the application on the Cell/B.E. compared to those from other processors.

3.1 Design Review Issues This section contains issues raised during the design review session and the team’s answers.  Use IBM’s XLC complier instead of gcc The performance of code generated by gcc was once inferior to that produced by XLC, but that gap has narrowed. The team is much more familiar with gcc. The build environment used by the team makes it easy to switch compilers, so the team can switch to XLC is needed or desired with out any significant work.

 Investigate the Minebench series papers The team has skimmed through two Minebench papers found on the BioPerf website (J. Zambreno et. al, R. Narayana et al.) and do not believe the papers are applicable to the project. The papers do not go into detail about the porting process nor give great detail on how the performance profiling was done.

 Single and double precision floating point numbers The SPEs are not able to handle double precision floating point numbers without taking a significant performance hit. The team does not plan to execute any double precision floating point code on the SPEs. Instead, if the selected applications contain double precision code, those sections will be executed on the PPE.

 Insufficient memory on the PlayStation 3 The PS3 only has 256MB of RAM available for the operating system to use. This will limited the applications that can effectively be ported to the PS3 and the input data that can be used. Too large of input data sets will cause significant page faults which will make comparisons to other systems even with the same amount of RAM impossible.

 Offload some computation from a PC to the PlayStation 3

17 The team believes this is outside the scope of their project. The team has also identified some difficulties in this approach. First, the Linux device driver for the Ethernet interface several known issues including randomly dropping packets when resources are not limited. The latency of such a system would be quite high and this would limit performance improvements. The RAM limitation mentioned above would continue to be a limiting factor.  Contact professors with expertise in Bioinfomatics The team will attempt to identify and contact processors in bioinformatics fields to answer questions and provide guidance.

 How will the team check to see if the ported applications output is correct. The applications produce textual output which can then quickly be compared wit the UNIX utility diff.

Article IV. Implementation This section contains the program to be ported selection, explanation of the ported program, previous work on parallelizing the selected application, possible approaches to parallelizing an application on the Cell/B.E, the educational ClustalW port the team accomplished, and a parallelized and a Cell/B.E. version of DNAPenny.

Section IV.1 Ported Program Selection The team profiled each of the applications included in the BioPerf suite with the GNU Profiler, GProf. Each application was profiled using the input data included with BioPerf. The team then analyzed the applications and profiling results with the following criteria in order to determine which programs would be best suited to port:  The application had not previously ported to the Cell/B.E.  The application’s source code was possible to be understood in the project timeframe. The team chose 10,000 lines of code as the maximum number of lines the team would be able to understand in one semester.  The application algorithm was documented in academic papers, web published, or included in the application’s documentation.  Either one of the following conditions were satisfied: o The overall application algorithm was suitable to be parallelized o A significant percentage of the application runtime was contained in a few functions, and those functions were suitable to be parallelized.

Both ClustalW and DNAPenny met the lines of code and parallelization criteria. However, ClustalW had already been ported to the Cell/B.E [Sachdeva] Therefore, the team decided to port DNAPenny to the Cell/B.E.

18 Section IV.2 Explanation of DNAPenny DNAPenny is used to find the most parsimonious phylogenetic tree (or trees) suggested for a set of species based on a subset of their aligned DNA sequences. In other words, it identifies the most likely evolutionary relationship, which is defined as the relationship requiring the fewest number of evolutionary changes, among the given species. The task of finding the correct phylogenetic trees is NP-Complete [Felsenstein] so DNAPenny employs a branch and bound algorithm to limit the trees needing processing to the most likely. However, despite the use of the branch and bound method, DNAPenny is still quite slow and becomes much slower as more species are added to the input set due to the number of potential trees that need to be processed.

DNAPenny’s search process initially attempts to generate every possible tree based on the input data. As it generates each tree, it calculates the number of base substitutions and estimates the minimum number of substitutions needed as more species are added. The branch and bound algorithm is a method of avoiding the generation of trees if they are estimated to require more base substitutions than a previously found tree. Because of this algorithm, DNAPenny can find the most parsimonious trees without needing to search every possibility.

Section IV.3 Previous Parallelization of DNAPenny The team was did not locate any previous attempts to parallelize DNAPenny on the Cell/B.E. During the team’s research, however, the team did discover previous work in parallelizing DNAPenny. Ramarao Desaraju in 2005 [Desaraju] discusses his work on parallelizing DNAPenny on computer clusters. Desaraju uses Message Passing Interface (MPI) to parallelize the task of finding the most parsimonious tree over a set of nodes in a computer cluster. This method, however, is not directly applicable to the team’s work on the Cell/B.E. because the team only has one PlayStation 3.

Section IV.4 Cell/B.E. Parallelization Models The Cell/B.E. literature identifies two general models to parallelize applications on the processor: PPE-centric or SPE-centric. In the PPE-centric model, most of the application code runs on the PPE, and small, individual functions are off-loaded onto the SPEs. The PPE instructs the SPEs to begin processing the data, and then waits until the SPEs notify the PPE that they are complete and then PPE can resume execution and process the results from the SPEs’ computation. In the SPE-centric model, most of the application code runs on the SPEs. The PPE only does limited coordination of the SPEs.

The PPE-centric model is further broken down into sub-models. These include function offload, multistage pipeline, and parallel stages model. In the function offload model, the SPEs are used to accelerate performance critical functions. In the multistage pipeline model, the functional work is broken up into at least two stages. Each stage is then assigned to a specific SPE. The intermediate results are pipelined from one SPE to the next one in the pipeline, repeating until processing is complete. This approach is suitable when the overall task is able to be divided up into data dependent tasks. The parallel stages model is a parallel version of the multistage pipeline model. This approach is suitable when the conditions for multistage pipeline are met, and there are multiple

19 streams of data needed processing. The method allows more of the SPEs to be utilized if fewer stages than SPEs are identified.

Section IV.5 Libspe2 IBM provides a library called libspe2 to assist programmers developing applications on the Cell/B.E. The library abstracts away the operating system interactions between the PPE and the SPEs. The library presents to the applications a virtualized SPE called an SPE context. Applications do not have direct control over the SPEs. The contexts are scheduled to run on the SPEs by the operating system.

The application makes a system call to inform the operating system that a specific context is to be scheduled on the available SPEs. The thread making the system call is blocked for the entire duration the context is executing code. In order to use multiple SPEs concurrently, the application must create multiple threads and a separate context for each thread.

The library also provides functionality for the SPE code. Functions are provided which simplify DMA transfers and communicating with the PPE.

Section IV.6 ClustalW Cell/B.E. Prototype Although the ClustalW application had already been ported to the Cell/B.E., the team felt it would be educational to complete a prototype port of the application. This would allow the team to gain experience with the issues associated with porting applications, while knowing that the application being ported is able to be ported to the Cell/B.E. The BioPerf suite included a version of ClustalW already parallelized called ClustalW_smp. The team used ClustalW_smp as a starting point for their porting work in order to concentrate solely on porting to the Cell/B.E. and to reduce the time spent on the port.

In order to reuse as much of the existing ClustalW_smp code as possible, the team used the PPE-centric model. Attempting to use the SPE-centric model would mean a significant rewrite of the existing code. The team selected the functional offload method as the most suitable PPE-centric model to use. ClustalW_smp uses a set of threads to divide the computational work of the algorithm among multiple processors. Each thread does a small amount of initialization, but spends most of its time doing computationally intensive tasks. The computationally intensive portions would be offloaded onto the SPEs. Furthermore, the pipeline models were not well suited because the computational portions had data dependencies on the pervious iteration. The porting process was broken into several steps.

First, the build environment of the application was converted to use one suitable for building Cell/B.E. applications. Specifically, a specialized compiler and linker must be used. Then, an initial version was created which offloaded the computationally intensive portion of the threads onto one SPE. For simplicity, the SPE context was recreated and before every iteration. This was addressed in the next version. The SPE context was

20 created and loaded once for each worker thread. The SPE code would block until the PPE notified it that there was additional data to process. The PPE thread blocks until it is notified by the SPE that the computation is complete and the results have been transferred back into main memory. The application was tested with the limited input data that was provided with the BioPerf suite to ensure the results were not altered.

The team did not spend any time optimizing or benchmarking their ported version of ClustalW. The port was done solely to gain a better understanding of the issues associated with porting an application to the Cell/B.E. and to help the team gauge how long a port of DNAPenny would take.

3.1 DNAPenny on Cell/B.E. Once the team had profiling information for DNAPenny, it became clear that most of the execution time of the program was spent in one function. This function turned out to be a great candidate for parallelization because it mainly consisted of a large loop which showed no data dependencies between iterations. The large loop contained two nested loops with the middle loop having a data dependency for determining when to exit the loop. The team felt that they would get the best speedup if the input to the function was broken up and sections of the input were processed simultaneously.

In the initial version of the ported DNAPenny, the entire input to the function was loaded into a SPE context and sent to one of the SPE elements. This slowed down the execution of the program because of the overhead of creating the context and transferring the data to the SPE. Once this was working correctly, the code was altered to create a variable amount of SPE contexts, allowing one context per available SPE. Every SPE context ran a thread, and the input to the function was divided as evenly as possible among the threads.

In order for the PPE and SPEs to synchronize correctly SPE events were used. Once the main thread of the program sent the parallel threads to the SPEs, the PPE thread would block until it received notification that each of the SPEs finished their execution. Once this notification is received the main thread would resume execution. Parallel DMA transfers from the PPE to the SPEs was implemented in an attempt to reduce the overhead of performing one transfer at a time, but there was no observable performance improvement likely due to the small amount of data needed to be transferred.

3.2 Parallelized DNAPenny The team created a parallelized version of DNAPenny which could be ran on typical personal computers. This version of DNAPenny is parallelized in a similar manner as the Cell/B.E. version. The input data is divided as evenly as possible among the available working threads. The main thread waits for all of the processing threads to notify it that they are finished. The team created this version of DNAPenny to have a better comparison for the parallelized version on the Cell/B.E.

3.3 Vectorized Parallel DNAPenny on Cell/B.E.

21 A significant advantage of the Cell/B.E. is its support for single instruction multiple data instructions (SIMD). With SIMD, the same operation, for example multiplication, can be done on multiple data by issuing a single instruction. Since the registers in the SPU are 128-bit and the application uses 32-bit integers, four integers can be manipulated in one instruction. In most cases, the compiler is unable to analyze the code in order in order to take full advantage of SIMD because it cannot safely determine all the data dependencies. Therefore, the team manually rewrote portions of the code executing on the SPE to take advantage of SIMD by using intrinsics provided by the compiler.

Intrinsics are macro-like facilities provided by the compiler that map to specific SIMD assembly instructions. Specifically the loop which processes each data element was altered to process four elements in every iteration. The intrinsics were used to do the computational operations such as addition and logical operations on the four elements in a single instruction.

Article V. Testing There were two phases of testing the ported DNAPenny application. First, the team must not have made any alterations to the original algorithm. The ported application must produce the same output as the original application. Secondly, the team must determine the performance improvements of the ported application.

3.4 Test Plan The BioPerf suite only included a single input file for DNAPenny. A single file is not suitable to ensure the application algorithm was not altered. Therefore the team spent considerable time obtaining additional input files. The team used DNA sequence data found at the GenBank website [GenBank] and ClustalW to generate global sequence alignments.

After the additional input files were obtained, the original application was used to generate the correct output files. These files were then committed to the team’s source code repository and could then be used as comparisons in the ported application. A shell script was created to automate the testing of the application. By automating the testing, the team would spend nearly no time doing manual testing.

To obtain the benchmark results, another test was created which recorded the time it took the application to process each input file. These test runs are then committed to the team’s source code repository for comparisons with other systems. The test script was run on several different systems. Furthermore, when benchmarking on the Cell/B.E., the number of SPE threads will be set to 1, 3, and 6 to determine how the team’s parallel implementation scales.

3.5 Test Results The ported application produces the exact same output as the original application for all of the input files.

22 The un-optimized version of the original code took approximately 2 hours and 10 minutes for the original input file on the PlayStation 3, while the hand parallelized, hand vectorized code ran in 2 minutes 10 seconds on 6 SPEs. The overall speed up was up to 60 times faster with the hand parallelized and hand vectorized code. Six input files were chosen from a wide range of run times.

The tables below show the performance results for the code revisions on the team’s four- way computer and the PS3. The dnapenny_orig revision refers to the original code from the BioPerf suite. The dnapenny_slimmer revision removes unused functions and dead code from the original DNAPenny code. Parallel_dnapenny_1.0 refers to the paralleized version of DNAPenny that runs on normal computers. The supplement_spe_parallel_XSPE refers to the version which utilizes X SPEs and uses only compiler optimizations. The supplement_parallel_vector_XSPE refers to the version which utilizes X SPEs and has hand vector optimizations.

Table 5: infile.orig PlayStation 4-Way 3.0GHz 3 X Code revision Machine (seconds) X Speedup (seconds) Speedup dnapenny_orig 823.568 1 7793.915 1 dnapenny_slimmer 360.131 2.28685673 941.981 8.273962 parallel_dnapenny_1.0 221.432 3.71928177 780.867 9.9811043 supplement_spe_parallel_1SPE N/A N/A 1111.471 7.0122522 supplement_spe_parallel_3SPE N/A N/A 443.521 17.572821 supplement_spe_parallel_6SPE N/A N/A 277.233 28.11323 supplement_parallel_vector_1SPE N/A N/A 260.952 29.867236 supplement_parallel_vector_3SPE N/A N/A 153.656 50.723141 supplement_parallel_vector_6SPE N/A N/A 130.59 59.682326 Table 5 - infile.orig

Table 6: in_b17x1237.txt PlayStation 4-Way 3.0GHz 3 X Code revision Machine (seconds) X Speedup (seconds) Speedup dnapenny_orig 431.5 1 4118.808 1 dnapenny_slimmer 177.704 2.4281952 492.8 8.3579708 parallel_dnapenny_1.0 121.591 3.54878239 530.097 7.7699138 supplement_spe_parallel_1SPE N/A N/A 585.103 7.039458 supplement_spe_parallel_3SPE N/A N/A 236.243 17.434625 supplement_spe_parallel_6SPE N/A N/A 150.73 27.325735 supplement_parallel_vector_1SPE N/A N/A 132.263 31.141045 supplement_parallel_vector_3SPE N/A N/A 78.316 52.592165 supplement_parallel_vector_6SPE N/A N/A 67.705 60.834621 Table 6 - in_b17x1237.txt

23 Table 7: in_b12x1236.txt PlayStation 4-Way 3.0GHz 3 X Code revision Machine (seconds) X Speedup (seconds) Speedup dnapenny_orig 334.923 1 2891.646 1 dnapenny_slimmer 141.241 2.37128737 366.39 7.8922623 parallel_dnapenny_1.0 92.848 3.60721825 308.202 9.3823077 supplement_spe_parallel_1SPE N/A N/A 422.884 6.8379177 supplement_spe_parallel_3SPE N/A N/A 182.395 15.853757 supplement_spe_parallel_6SPE N/A N/A 122.248 23.653933 supplement_parallel_vector_1SPE N/A N/A 114.687 25.213372 supplement_parallel_vector_3SPE N/A N/A 71.271 40.572547 supplement_parallel_vector_6SPE N/A N/A 63.434 45.585112 Table 7 - in_b12x1236.txt

Table 8: in_b10x1219.txt PlayStation 4-Way 3.0GHz 3 X Code revision Machine (seconds) X Speedup (seconds) Speedup dnapenny_orig 264.51 1 2197.3 1 dnapenny_slimmer 109.755 2.4100041 291.101 7.5482393 parallel_dnapenny_1.0 78.077 3.38780947 236.387 9.2953504 supplement_spe_parallel_1SPE N/A N/A 327.592 6.7074288 supplement_spe_parallel_3SPE N/A N/A 141.428 15.536527 supplement_spe_parallel_6SPE N/A N/A 99.551 22.072104 supplement_parallel_vector_1SPE N/A N/A 99.237 22.141943 supplement_parallel_vector_3SPE N/A N/A 63.141 34.799892 supplement_parallel_vector_6SPE N/A N/A 57.968 37.905396 Table 8 - in_b10x1219.txt

Table 9: in_a18x822.txt PlayStation 4-Way 3.0GHz 3 X Code revision Machine (seconds) X Speedup (seconds) Speedup dnapenny_orig 81.345 1 721.435 1 dnapenny_slimmer 34.388 2.36550541 90.586 7.9640894 parallel_dnapenny_1.0 27.997 2.90548987 74.087 9.7376733 supplement_spe_parallel_1SPE N/A N/A 110.862 6.5075048 supplement_spe_parallel_3SPE N/A N/A 53.316 13.531304 supplement_spe_parallel_6SPE N/A N/A 39.491 18.26834 supplement_parallel_vector_1SPE N/A N/A 29.543 24.419829 supplement_parallel_vector_3SPE N/A N/A 20.707 34.840151 supplement_parallel_vector_6SPE N/A N/A 19.648 36.717987 Table 9 - in_a18x822.txt

24 Table 10: in_a15x822.txt PlayStation 4-Way 3.0GHz 3 X Code revision Machine (seconds) X Speedup (seconds) Speedup dnapenny_orig 23.187 1 205.518 1 dnapenny_slimmer 9.582 2.41984972 25.253 8.1383598 parallel_dnapenny_1.0 9.149 2.53437534 22.25 9.236764 supplement_spe_parallel_1SPE N/A N/A 31.736 6.4758634 supplement_spe_parallel_3SPE N/A N/A 16.118 12.750838 supplement_spe_parallel_6SPE N/A N/A 13.664 15.040837 supplement_parallel_vector_1SPE N/A N/A 10.272 20.007593 supplement_parallel_vector_3SPE N/A N/A 7.871 26.110786 supplement_parallel_vector_6SPE N/A N/A 8.312 24.725457 Table 10 - in_a15x822.txt

The graphs below graphically represent the performance results above.

Figure 6 – infile.orig

25 Figure 7 - in_b17x1237.txt

Figure 8 - in_b12x1236.txt

26 Figure 9 - in_b10x1219.txt

Figure 10 - in_a18x822.txt

27 Figure 11 - in_a15x822.txt

Article VI. Concluding Material This section includes the team’s total hours spent on the project and the earned value analysis, outstanding issues, future work, and the conclusion.

3.6 Total Hours The table below shows the hours spent on each task by each team member on the project.

Table 11 Team Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Totals Member Kyle 24 11 4 58 35.5 10 8 19 169.5 Byerly Shannon 26 8 3 63 14 9 9 25 157 McCormick Matt Rohlf 23 9 5.5 54 13 11 8 34 157.5 Bryan 27.5 9 5 97 16 12 10 21 197.5 Venteicher Totals 100.5 37 17.5 272 78.5 42 35 99 681.5 Table 11: Total Hours

28 3.7 Earned Value Analysis The tables below show the earned value analysis for the project.

Task Estimated Actual % Budgeted Budgeted Costs Hours Hours Complete Costs of of Work Work Performed Scheduled Problem Definition 100 100.5 100% $1,000.00 $1,000.00 Technology and 36 37 100% $360.00 $360.00 Implementation Considerations End-Product Design 20 17.5 100% $200.00 $200.00 End-Product Prototype 320 272 100% $3,200.00 $3,200.00 Implementation End-Product Testing 60 78.5 100% $600.00 $600.00 End-Product Documentation 40 42 100% $400.00 $400.00 End-Product Demonstration 48 35 100% $480.00 $480.00 Project Reporting 140 99 100% $1,400.00 $1,400.00 Total 764 681.5 100% $7,640.00 $7,640.00 Table 12 – Earned Value Analysis

Task Actual Cost Schedule Cost Schedule Costs of Variance Variance Performance Performance Work Index Performed Problem Definition $1,005.00 -$5.00 $0.00 99.5% 100.00% Technology and $370.00 -$10.00 $0.00 97.3% 100.00% Implementation Considerations End-Product Design $175.00 $25.00 $0.00 114.3% 100.00% End-Product Prototype $2,720.00 $480.00 $0.00 117.6% 100.00% Implementation End-Product Testing $785.00 -$185.00 $0.00 76.4% 100.00% End-Product Documentation $420.00 -$20.00 $0.00 95.2% 100.00% End-Product Demonstration $350.00 $130.00 $0.00 137.1% 100.00% Project Reporting $990.00 $410.00 $0.00 141.4% 100.00% Total $6,815.00 $825.00 $0.00 112.1% 100.00% Table 13 – Earned Value Analysis

3.8 Outstanding Issues The team is not aware of any outstanding issues regarding either the parallelized version or the Cell/B.E. port of DNAPenny.

3.9 Future Work There are additional applications in the BioPerf suite that could be ported to the Cell/B.E. The team did not consider some applications because there were too many lines of code. Additional manual vectorizaition of the DNAPenny port could be further refined. While the vectorization to most significant portions of the code, other portions could also benefit from vectorization.

29 3.10 Conclusion The team successfully ported the DNAPenny application from the BioPerf suite to the Cell/B.E. The measured performance improvement is significant – up to a 60 time reduction in the runtime of the application. Furthermore, the team members gained valuable knowledge of bioinformatics and programming on the Cell/B.E.

Article VII. References M.I.T. 6.189 Class website: http://cag.csail.mit.edu/ps3

R. Narayanan, B. Özışıkyılmaz, J. Zambreno, G. Memik, and A. Choudhary, MineBench: A Benchmark Suite for Data Mining Workloads, 2006 IEEE International Symposium on Workload Characterization, pages 83-93, San Jose, CA, October 2006.

Sachdeva, Vipin; Kistler, Michael; Speight, Evan; Tzeng, Tzy-Hwa Kathy. 2007. “Exploring the Viability of the Cell Broadband Engine for Bioinformatics Applications.” IBM.

J. Zambreno, B. Özışıkyılmaz, G. Memik, A. Choudhary, and J. Pisharath, Performance Characterization of Data Mining Applications using MineBench, 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-9), Austin, TX, February 2006.

Felsenstein, Joseph. “DNAPENNY - Branch and bound to find all most parsimonious trees for nucleic acid sequence parsimony criteria”. The University of Washington. 2000.

Desaraju , Ramarao. “A Parallel Implementation of a Parsimony-based method for Phylogenetic Inference.” University of Colorado at Colorado Springs. May 2005.

National Center for Biotechnology Information GenBank website: http://www.ncbi.nlm.nih.gov/Genbank

30