<<

Jennifer Moore

Pipeline

Pipelining is an instruction set in the Xeon Phi that lists several steps to a fetch and execute cycle. Today’s are able to faster by utilizing pipelining, which makes processing multiple instructions simultaneously possible. Pipelining can be described as the basic path through the design of any . Shantanu Dutt explains as a “concept in which the entire processing flow is broken up into multiple stages, and a new data/instruction is processed by a stage potentially as soon as it is done with the current data/instruction, which then goes onto the next stage for further processing” (Dutt, 2001). Comparing it to the Very Long Instruction Word operation, it is very similar in the concept of using parallelism and having different steps that work together and paralleled. The user can operate more than one instruction set at the same time.

These steps or instructions help describe the different steps to operate and perform a fetch and execute cycle. For example, each step can be known as the instructions that are given to the Little Man Computer so it can execute an operation. Each of these steps is timed and can result in a delay depending on the results of each step (Englander, 2009). Pipelining is similar to the Little Man Computer, however, LMC uses different operation codes or instruction sets to determine the outcome of a number that is input by the user. On the other hand, pipelining has different instruction sets to set up a path to request data on a timed basis.

(Englander, 2009).

Disk

A Disk Cache is made up of the main memory or integrated memory within most of the new disk drives. The disk cache makes it possible to access information from the disk faster by storing frequently used data in temporary memory so that is promptly accessible. Englander express “when a disk read or write request is made, the system checks the disk cache first. If the required data is present, no disk access is necessary; otherwise, a disk cache line made up of several adjoining disk blocks is moved from the disk into the disk cache area of memory” (Englander, 2009). Caching allows the system to temporally store commonly used data where it can be quickly accessed without accessing.

In the diagram below, it shows that the accessing data from the disk cache. If the server finds the data requested in the disk cache, it does not have to access the disk. When the data is accepted and stored, the Vickovic, Celar, Mudnic article explains that “when the request is stored, the amount of free space on Disk Cache is decreased and it is pushed on cache queue” (Vickovic, Celar & Mudnic, 2011). Data from the disk cache located in the Xeon Phi processor can be transmitted faster than actually reading data directly from the drive itself.

(Vickovic, Celar & Mudnic, 2011) ** (similar to the one from this article)**

Very Long Instruction Word (VLIW)

A very long instruction word is a component used in the Xeon Phi processor that provides instructions for programs to perform efficiently. According to the Englander text, the main purpose of this architecture is “to increase execution speed by processing instruction operations in parallel” (Englander, 2009). Working on a high-speed program, VLIW would have to be used by the user. VLIW consist of numerous processors that help enhance the process of the program in order to successfully run faster. Binu Mathew explains the VLIW as “one particular style of that tries to achieve high levels of parallelism by executing long instruction words composed of multiple operations” (Philips, 2008). To have a CPU that runs fast and efficient in running programs, then one needs to get one with a Very Long Instruction .

VLIW can be characterized by a processor known as the Transmeta Crusoe, which is a processor design. The Transmeta Crusoe consists of different instructions. Englander explains it as “a 128 bit instruction word called molecule. The molecule is divided into four 32-bit atoms. Each atom represents an operation similar to those of a normal 32-bit instruction word” (Englander, 2009). The diagram below demonstrates the 128 bit instruction Englander explains in the text. Compared to the LMC, they both perform a fetch and execute cycle. Each can add, load, branch on condition, and store numbers. There are four operations that the atoms are used in the instruction word. These atoms collaborate to complete the execution cycle. By using parallelism, there are two cycles that work simultaneously.

(Englander, 2009).

References

Pipeline

Englander, I. (2009). The architecture of , systems software, & networking. (4th ed., p. 253). Wiley.

Dutt, S. (2001). Pipeline basics-lecture notes #14 Retrieved from http://www.ece.uic.edu/~dutt/courses/ece366/lect-notes.html

Disk Cache

Vickovic, L., Celar, S., & Mudnic, E. (2011). Disk array simulation model development. Retrieved from http://ehis.ebscohost.com.proxygsu- sct1.galileo.usg.edu/eds/pdfviewer/pdfviewer?sid=1eb5e2fd-ad53-4fad-8dc2- 8357a74e92b8@sessionmgr14&vid=6&hid=101

Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed., p. 263). Wiley.

Very Long Instruction Word

Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed., p. 244). Wiley.

Philips. (2008). An introduction to very-long instruction word (vliw) . Retrieved from http://twins.ee.nctu.edu.tw/courses/ca_08/literature/11_vliw.pdf IT5200 Kornchai Anujhun

Ring

Ring bus is a substation switching arrangement that may consist of four, six, or more breakers connected in a closed loop, with the same number of connection points. Figure1 depicts the layout of a ring bus configuration, which is an extension of the sectionalized bus. In the ring bus a sectionalizing breaker has been added between the two open bus ends. In other words, there is a closed loop on the bus with each section separated by a circuit breaker. This provides greater reliability and allows for flexible operation.

Figure 1 Ring bus

Figure2 4-Breaker Ring Bus in ATI Graphic Card

1 IT5200 Kornchai Anujhun

USB

Universal Serial Bus, also known as USB, is a standard type of connection for many different kinds of devices. Generally, USB refers to the types of cables and connectors used to connect these many types of external devices to computers. The Universal Serial Bus standard has been extremely successful. USB ports and cables are used to connect hardware such as printers, scanners, keyboards, mice, flash drives, external hard drives, joysticks, cameras, and more to computers of all kinds, including desktops, tablets, laptops. In fact, USB has become so common that you'll find the connection available on nearly any computer-like device such as video game consoles, home audio/visual equipment, and even in many automobiles. Many portable devices, like Smartphone, eBook readers, and small tablets, use USB primarily for charging. USB charging has become so common that it's now easy to find replacement electrical outlets at home improvement stores with USB ports built it, negating the need for a USB power adapter.

Figure3 USB Connection

2 IT5200 Kornchai Anujhun

Memory Interleaving

Memory interleaving is a method to increase the speed of the high-end . This is a memory access technique that divides the system memory into a series of equal sized banks. These banks are expressed in terms of n-way interleaved: 2- way interleaving, which is using two complete address buses, 4-way interleaving, which is using complete four address buses, and 8-way interleaving, which is using complete eight address buses. While one section is busy processing upon a word at a particular location, the other section accesses the word at the next location.

Figure4 2-way Interleaved Memory

In a 2-way interleaved memory system, there are two physical banks of DRAM, but logically the system sees one bank of memory that is twice as large. In the interleaved bank, the first long word of bank 0 is followed by the first long word of bank 1, which is followed by the second long word of bank 0, which is followed by the second long word of bank 1, and so on. Figure2 shows this organization for four physical banks of N long words. All even long words of the logical bank are located in physical bank 0 and all odd long words are located in physical bank 1.

3 IT5200 Kornchai Anujhun

References

Schuette, M. (2011, Jan 02). Intel’s I. Architecture&CPU Performance. One Ring Bus to Master Them All. Retrieved from http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view &id=98&Itemid=1&limit=1&limitstart=6

Wikipedia. (2011, Dec). Network Topology. Retrieved from http://en.wikipedia.org/wiki/Network_topology

Shimpi, A. (2010, Sep 14). Intel’s Sandy Bridge Architecture Exposed. Retrieved from http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/4

Wikiedia. (2012, Dec). Universal Serial Bus. Retrieved from http://en.wikipedia.org/wiki/Universal_Serial_Bus

PCMag. USB Definition. Retrieved from http://www.pcmag.com/encyclopedia_term/0,2542,t%3DUSB&i%3D53531,00.as p

Matloff, N. (2003, Nov 05). Memory Interleaving. Retrieved from http://heather.cs.ucdavis.edu/~matloff/154A/PLN/Interleaving.pdf

ORNL Physics Division. Interleaved Memory. Retrieved from http://www.phy.ornl.gov/csep/ca/node19.html

4

An execution unit also called a functional unit is a part of CPU that preforms the operations and calculation called for by the Branch Unit, which receives data from the CPU. It may have its own internal control sequence unit, some registers and other internal units such as a sub ALU. It is commonplace for modern CPUs to have a multiple execution units, referred to as superscalar design. The simplest arrangement is to use one, the bus manager, to manage interface and others to perform calculations. Modern CPUs execution units are usually pipelined. To execute an instruction, the Execution Unit must first fetch the object code byte from the instruction queue and then execute the instruction. If the queue is empty when the execution unit is ready to fetch an instruction byte, Execution Unit waits for the Bus Interface Unit to fetch the instruction byte.

In , the instructions operate on data read from the coprocessor load data queue. Data is written back, for example to memory or a by inserting the data into the out-of-order execution pipeline, either directly or via the coprocessor store data queue which writes back the data.

Ja’afar Bilal

IT5200: Definitions Assignment

Class: Tuesday & Thursday

Due: February 17, 2013

Multi-core Processor: A computer chip that contains two or more cores within the same framework; cores can also be referred to as a (CPU), which executes and reads instructions from programs.

Within any computer system, a processor is considered to be the “ brain” of the computer. Figure 7.1 from the textbook highlights four main components within the CPU, which include: the (ALU), (CU), registers and an Input/Output (I/O) interface.1 The ALU uses arithmetic and logic operations which calculates and manipulates bits that are represented using binary code. The CU controls the instructions and data flow within the CPU. The memory within the CPU consists of registers (not shown in Figure 7.1), which store binary code used for computation. Finally, the I/O interface allows for the transfer of data to peripheral devices.

The processor controls the computer by fetching and executing program instructions. In brief, in order to execute instructions, the Program (PC), located within the CU, will send the address of the instruction it holds to the (MAR). The MAR will then send the instruction on the address bus to the RAM. The RAM then sends the instruction on the data bus which is then sent to the Memory Data Register (MDR). The MDR holds the address value and is copied over to the (IR). Now the instruction is decoded and executed. The PC now resets.

In a multi-core processor, each core consists of the same architecture. Each core has access to the same input and output interface, programs, and memory.1 There are a wide range of processors used in a computer system which include: dual-core(2), quad-core(4), hexa-core(6), and octa-core(8). See Figure 1.1 below. With multi-core processors, the workload is distributed among the cores resulting in overall speed and efficiency of the computer system; it also allows for computer to run multiple programs with ease. For example, if a computer has a quad-core processor, it would be able to simultaneously run programs such as uTorrent, , and Word, all while the anti-virus system is running in the background.

Technology companies such as Intel produce multi-core processors that contain 50 or more cores within its framework; the Xeon Phi Knights Corner coprocessor is such a device. Large businesses and companies such as CNN or Google may use Knights Corner to manage large amounts of information and data within their server center.

1. The Architecture of Computer Hardware, Systems Software, & Networking. Englander, Irv. Pg: 265.

Direct Memory Access (DMA): Allows for devices to engage in block data transfer by directly accessing the computer’s main memory by bypassing the CPU. It is important to know that DMA is the third method of data transfer; the first is Programmed I/O and second is Interrupt.

The Xeon Phi Knights Corner coprocessor allows for DMA. DMA is designed to ease the workload of the CPU during data transfer. This results in faster CPU processing and allows for an easier method of accessing the main memory.

The only time the CPU is involved in DMA is when it initiates and ends the data transfer. During DMA, the DMA Controller takes command until the CPU is interrupted when the data transfer is complete.1 The DMA Controller is the I/O module because it is an interface between the CPU and I/O device.

These three conditions must be met for DMA to occur2: 1. The I/O device and the main memory must have a line of communication, which will be a bus. 2. The DMA Controller must read and write to the main memory. 3. The DMA Controller and CPU must avoid conflict.

In order for the DMA Controller to control the data transfer, it must know2: 1. The data location of I/O device. 2. The location of data in main memory. 3. The size of the data transferred. 4. Transfer from the I/O device to memory and from memory to the I/O device.

1. “Direct Memory Access (DMA),” www.ece.ubc.ca/~edc/379.jan99/lectures/lec13.pdf, Date Accessed: February 13, 2013

2. The Architecture of Computer Hardware, Systems Software, & Networking. Englander, Irv. Pg: 298-299. Control Line: A type of conductor that is associated with buses.

The Xeon Phi Knights Corner coprocessor is connected to other parts of the computer system (i.e. main memory, I/O modules, etc.) through what is known as a bus. Buses acts as a “data highway” which make data transfer possible. Within a bus, there are several conductors that carry electrical signals; the signals represent bits of memory. The control line is one of these conductors; the three other types of conductors include the address, data, and power lines. Control lines within a bus determine the read and write instructions for the data transfer. It also specifies the number of bytes that will be transferred. A control line is necessary because there are many different components that communicate using a bus.1

1. The Architecture of Computer Hardware, Systems Software, & Networking. Englander, Irv. Pg: 214-215. Dean Griffiths IT5200 – Intro Platforms & OS Xeon Phi Definition

Mask register

The Interrupt Mask Register or Mask register is a read and write register Within the Xeon Phi coprocessor, it enables or masks interrupts from being triggered on the external pins of the cache controller and choose the appropriate address depending upon which interrupt controller wanting to be use. The register is also an eight bit register that lets you individually enable and disable interrupts from devices on the system. Writing a zero to the corresponding bit enables that device's interrupts. Writing a one disables interrupts from the affected device.

Instruction set architecture (ISA)

The ISA also called CPU architecture is the part of the computer architecture and is a well - designed hardware/software interface. The characteristic include the number and types of registers, methods of addressing memory and basic design and layout of the instruction set. It describes the operations, modes and storage location supported by hardware, plus how to invoke and access them. The ISA includes a specification of the set of machine language and the commands implemented by a particular processor (Xeon Phi), and though the Instruction set architecture (ISA) interconnects hardware and software, it is part of the Application Binary Interface (ABI) which provides a program with access to the hardware resource and services available in a system.

Scalar processing Represents a class of computer processors that processes one data at a time, typically an integer of floating point number, this is also classified as a "single instruction stream single data stream" (SISD)

Diagram show the difference between Scalar and Superscalar processing. Where does this fit in Xeon Phi? Well with Scalar execution, a single execution unit is performed, granted that different execution type of branch conditions are simultaneously executed, the CPU can average instruction executed approximately equal to the clock speed of the machine in comparison to multiple execution units (Superscalar) where it is possible to process instructions with an average rate of more than one instruction per clock cycle.

References:

The Art of . (2006). Programming. Interrupts Traps and Exceptions (Part 3): http://www.oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_17/CH17- 3.html

Irv Englander. (2009). The Architecture of Computer Hardware, System Software, and Networking. Fourth edition

Wikipedia. : http://en.wikipedia.org/wiki/Scalar_processor loc-nguyen. Intel® Xeon Phi™ Coprocessor Developer's Quick Start Guide: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide

ARM. The Architecture for the Digital World. 3.3.10. Register 2, Interrupt Mask Register: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0329l/Beiihfcc.html

Allan Snavely. Instruction Set Architecture: http://www.sdsc.edu/~allans/cs141/L2.ISA.pdf

Wikipedia. Instruction set: http://en.wikipedia.org/wiki/Instruction_set

Luke Varner February 17, 2012 Xeon Phi Terminology

Cluster Environment

A cluster takes the classic definition of the word and simply applies it to computers. Simply put by the book a cluster refers to a collection of "loosely coupled computers configured to work together as a unit"(Englander, 2009). The concept of a cluster environment encourages combining your computing power together to distribute a workload across multiple devices. The individual computers within the cluster are referred to as nodes and can be used to almost indefinitely increase overall processing power. Cluster environment computing is primarily "used for computation-intensive purposes, rather than handling I/O-oriented operations such as web service or databases"(""). In the context of Xeon Phi, clustering is intended to have uses in super-computing where grouping individual Xeon Phi together will achieve superior processing power while reducing power consumption and energy costs. The latest commercially available version of the Xeon Phi has 60 cores. Using just 10 individual Xeon Phi nodes together in a cluster environment can increase processing power tremendously while taking up minimal space. With just ten different cluster units you would be harnessing the power of 600 CPUs! In fact you could describe the Xeon Phi processor as a cluster within a cluster do to the number of cores it possesses. In the figure below we can see how the Xeon Phi could be used in a cluster environment. Figure A shows a Xeon Phi processor on its own and in figure B shows how it could be combined together and with other processors.

Xeon Phi in a Cluster Environment

[Intel Phi Inner] Retrieved from: http://www.amax.com/hpc/images/intel_phi_inner.jpg Hit Ratio

The hit ratio, also commonly known as the hit rate is used in measuring the cache within in a CPU. The hit ratio is the number of records found and run when executing a particular process to the total number of records that are available ("Hit rate"). A high hit ratio (<90%) is equated with better CPU cache performance whereas a lower hit ratio can indicate problems with your system configuration. When a miss is recorded it leads to a stall in the execution of instructions which will produce visible effects on performance. In the Xeon Phi product description of cache performance certain features are mentioned that are designed to reduce the number of misses which effectively keeps the hit ratio high. It works by recognizing when a hit has occurred and then queries directories in order to locate the data in order to return it back to the correct location. Because the Xeon Phi possesses a large number of individual cores in order to eliminate the number of cache misses it has the ability to access all of the individual cores in order to locate the missing data. These features in the Xeon Phi reduce data redundancy and illustrate an intuitive piece of technology that can recognize and fix its own mistakes. In the figure below part A. demonstrates how the L2 caches are all interconnected so that requested data can be tracked on each core. Part B. shows how the tag data is searched to find the relevant data before moving on to the next core cache. Using this feature should result in a high hit ratio for all of the cores and the system as a whole.

Xeon Phi Cache and Hit Ratio Performance

[Distributed Tag Directories] Retrieved from: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor- codename-knights-corner

Write Through

A write through is a process involved with storing data where information is written to both the main memory and the cache at the same time. This method allows for quick retrieval of information and ensures the integrity of the data against power outages or system down time. On the other hand, due to the redundant nature of having to write to two locations before execution of another step, a sacrifice in system speed performance must be made (Gibilisco & Rouse, 2012). The write through method emphasizes data consistency and integrity over speed. Looking through the system developers guide for the Xeon Phi it states that only uncacheable and write back methods are able to be used and "the other three memory forms [write-through, write-combining, and write-protect] are mapped internally to behavior" (Intel Corp, 2012). What is being described here is that the latter of the three memory forms are automatically controlled by each of the 60 individual cores in a Xeon Phi coprocessor and cannot be altered or changed. In the diagram below you could say that A represents one of the cores in the Xeon Phi, it shows the CPU and cache. In using the write through method the data is written to the cache but also to the RAM which is illustrated in part B.

Write Through in a Xeon Phi Core

[Write-through cache] Retrieved from: http://www.brainbell.com/tutors/A+/Hardware/Cache_Memory.htm

Sources

Cache memory. (n.d.). Retrieved from http://www.brainbell.com/tutors/A /Hardware/Cache_Memory.htm

Chrysos, G. (n.d.). Intel® xeon phi™ coprocessor (codename knights corner). Retrieved from http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

Computer cluster. (n.d.). Retrieved from http://en.wikipedia.org/wiki/Computer_cluster

Englander, I. (2009). The architecture of computer hardware, systems software, & networking. (4th ed.). Hoboken, NJ: John Wiley & Sons Inc.

Gibilisco, S., & Rouse, M. (2012, July). Write through. Retrieved from http://whatis.techtarget.com/definition/write-through

Hit rate. (n.d.). Retrieved from http://www.answers.com/topic/hit-rate

Intel Corp. (2012, November 8). Intel® xeon phi™ coprocessor system software developers guide. Retrieved from

Intel phi inner. (n.d.). Retrieved from http://www.amax.com/hpc/images/intel_phi_inner.jpg

IT 5200 Platforms & OS Lloyd Middlebrooks Xeon Phi Definitions Assignment February 4, 2013 Xeon Phi Definitions

Vector Processing Unit (VPU) The VPU executes instructions that take the form of vectors. Vectors are one-dimensional arrays of data. In a , a single instruction operates simultaneously on multiple data items (arrays); whereas a scalar processor processes a single data item element (integers and floating numbers) at a time. VPUs are used to perform essential numerical calculations associated with High Performance Computing (HPC).

Within the Xeon processor is the Xeon Phi coprocessor. The VPU is located inside each core of the coprocessor as shown in Figure 1. The Xeon Phi's VPU features a 512-bit SIMD (Single Instruction, Multiple Data) instruction set and can execute 8 double precision or 16 single precision operations per cycle in its Arithmetic Logic Unit (ALU) as shown in Figure 2. The VPU benefits the Xeon processor by reducing the amount of fetch and decoding operations that can incur higher energy costs and increase bandwidth.

software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 1

software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 2 IT 5200 Platforms & OS Lloyd Middlebrooks Xeon Phi Definitions Assignment February 4, 2013

Clock The internal clock refers to an internal signal that operates at a constant frequency in a that regulates the rate at which instructions are executed. Essentially, the clock controls the time when each step in the takes place and synchronizes various computer components. The is usually measured in megahertz (MHz) and gigahertz (GHz).

The Xeon Phi has a clock speed of approximately 1.1 GHz, meaning that the Xeon Phi can process 1.1 billion instruction cycles per second. A unique characteristic of the Xeon Phi is the that it has a gated clock. This means that when the Xeon processor's workload does not demand assistance from the Xeon Phi coprocessor (all four threads on a core are halted), the clock is gated, shown in red within Figure 3. Subsequently, the core is powered down after the programmed amount of time. This preserves power and ensures the Xeon processor is operating efficiently.

software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 3

Thread Englander (2003) defines a as an individually executable part of a process (executable program). Although threads can be scheduled to run separately from other threads in the same process, they also can share memory and other resources with those associated threads. Each core of the Xeon Phi coprocessor can support four threads in hardware. This is referred to as multithreading. During multithreading, the coprocessor often between threads (context switching). The threads are passed between the bus highlighted between the red lines in Figure 4. IT 5200 Platforms & OS Lloyd Middlebrooks Xeon Phi Definitions Assignment February 4, 2013

software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Figure 4

Sources:

Clock Speed. (2013). Retrieved February 17, 2013, from http://www.webopedia.com/TERM/C/clock_speed.html

Englander, Irv. (2009). The Architecture of Computer Hardware, Systems Software & Networking. Process Control Management (pp. 493). Hoboken: John Wiley & Sons, Inc.

Gilbert, H. (2004, December 22). Clock Speed: Tell Me When It Hurts. Retrieved February 17, 2013, from http://www.yale.edu/pclt/PCHW/clockidea.htm

Intel Xeon Phi Coprocessor (codename Knights Corner). Retrieved February 17, 2013, from http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights- corner

Pratyusa Manadhata, Vyas Sekar. (2013). Vector Processors. Carnegie Mellon University. Retrieved February 17, 2013, from http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/www/lectures/vector.pdf

Product and Performance Information. Retrieved February 17, 2013, from http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

Thread. (2013). Retrieved February 17, 2013, from http://en.wikipedia.org/wiki/Thread_(computing)

Vector Processor. (2013). Retrieved February 17, 2013, from http://en.wikipedia.org/wiki/Vector_processor

SIMD

Single Instruction Multiple Data is one of the classifications defined by Flyn. It is based upon the number of concurrent instruction and data streams available in the architecture. A computer exploits multiple data streams against a single instruction steam to perform operations which may be naturally parallelized. An array processor or GPU is a good example. The first use of SIMD instructions was in vector super computers. Today, most commodity CPUs implement architectures that features instructions for a form vector processing on multiple data sets, typically known as SIMD.

Xeon Phi Coprocessor has 512 bit registers for SIMD operations and considered as one of the key features. High performance codes on the Xeon Phi Coprocessor utilize these wide SIMD instructions to extract desired performance level. The best performance will be achieved when the number of cores, thread, and SIMD or vector operation are used effectively. The best method to take advantage of the 512 bit wide SIMD instruction is to write using an array notation style. When array notation is used, the complier will utilize the SIMD instruction set.

SMP

Symmetric involves a multiprocessor computer hardware architecture where two or more identical processors are connected to single shared Main Memory (MM) and are controlled by a single . Most common multiprocessor systems today use SMP architecture. Usually each processor has an associated private high speed memory known as cache memory to speed up the MM data access and to reduce the system bus traffic. Processors may be interconnected using buses, crossbar switches or on-chip mesh network. SMP systems allow any processor to work on any task no matter where the data for that task are located in memory provided each task in the system is not in execution at the same time. SMP systems can easily move task between processors to balance the workload efficiently with proper operating system support. Software programs have been developed to schedule jobs so that the processor utilization reaches its maximum potential.

Xeon Phi Coprocessor runs Linux. It really is an SMP on –a-chip running Linux. Every card has its own IP address. VSMP foundation supports the Xeon Phi Coprocessor. This revolutionary coprocessor is based on Intel architecture and is designed for highly parallel workloads. They support the Xeon Phi Coprocessor in two modes: processor-virtualized mode and coprocessor-aggregated mode.

Technology Definitions GDDR Memory GDDR5 memoryis used primarily for graphics processing and is a part of the circuitry on a graphics board (PCI Express). Both GDDR 3 and GDDR4 and are the current two standards in use and are differentiated by performance and price, with DDR4 beong the highest performer along with the highest price. This not related in any way to the numbering of processor storage.

AMD has developed a new graphics memory chip named GDDR5 keeping in suite with the last two offerings. GDDR5 has twice the speed of GDDR3 due to a wider bandwidth. It was orrigionally developed by AMD to use with graphocs processing giving more speed and a higher quality. As you can see from the screen shot below, Intel not only uses it for traditional graphics processing but has created a high speed PCI ring interconnect to use for connecting each coprocessor and each core directly.

Refferences: Intel -Intel® Xeon Phi™ Coprocessor AMD Corp ExtremeTech - GDDR5 Memory–Under the Hood

Branch History Table

A Branch Prediction circuit was developed by HP to try and predict which way an IF- THEN-ELSE branch would prior to it’s execution. It will fetch the instruction of it’s prediction prior to the actual location counter pointer and the instruction is “specutavily executed”. If the choice turns out to be wrong, it is reversed and is put back in the pipeline. The results of these operations are kept in the Branch History Table, frequently called something similar but not necessarily the same. The larger and more populated the table, the better the hit ratio of preditions.

I did not find a reference to this in the Xeon Phi documentation but Intel probably has something similar as the article leads one to believe that this circuit is very commenly used.

I found another reference in an IBM publication which referred to the whole process as the Branch History Table and is probably the best source to read if ou have further interest.

Sources WIKAPEDIA:

T. J. Watson Research Center: Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns

Write-Back This is a performance technique where the data is when data is updated the most current copy is updated in cache rather than in processor memory. It is used when the risk of losing current data does not outweigh the gains in performance.

This can be contrasted to the technique of of “Write Through” where the data is updated both in cache and main memory. This obviously is safer but also significantly slower.

Sources Microsoft: Write Back - Part of the Backup and recovery glossary

Xeon Phi Definitions

Extended Math Unit (EMU) : the EMU is a shared function that deals with more complex math operation on behalf of several Execution Unit (EU), it provides faster implementation of the single precision transcendental functions like reciprocal, reciprocal square root, base 2 logarithms, and base 2 exponential functions using lookup tables. It performs this operation inside the vector processing unit. It is an implemented hardware that can also achieve high throughput of 1 – cycle or 2 – cycle other transcendental functions that can be derived from elementary functions. Relating it to Xeon Phi coprocessor, it also performs the same function as Intel; it allows operations to be done in the form of a vector fashion with high bandwidth. And also calculates the polynomial approximations of these functions . For example, the figure below shows it operation

Function Math Tpt. Cycle Reciprocal 1/x 1 R Sqrt 1/ √2 1 Exp2 2 2 Log2 log x 1

Figure 1: A simple EMU Calculation

References http://software.intel.com/en-us/articles/achieving-high-performance-on-monte-carlo- european-option-on-intel-xeon-phi http://software.intel.com/en-us/articles/case-study-achieving-superior-performance-on- black-scholes-valuation-computing-using

Instruction Unit: Implements the basic instruction pipeline, fetching instructions from the memory subsystem, dispatching them to available execution units, and maintaining a state history to ensure that the operations finish in order. It also used in executing conditional branch and unconditional jump instructions. The is part of the Control Unit, which in turn is part of the CPU, which handles all the preparation of instructions for execution. It is responsible for organizing program instructions that are to be fetched and executed in an appropriate order. Instruction unit performs a lot of operations in relation to both Intel and Xeon Phi coprocessor.

Figure 2: Shows how the Instruction Unit operates This diagram was retrieve from http://openrisc.net/or1200-spec.html

References http://openrisc.net/or1200-spec.html http://en.wikipedia.org/wiki/Instruction_unit

Miss: This word is used in cache memory; it is usually called the Cache Miss. Miss is a situation in which the request is not already present in cache memory, or when accessing a memory location that is not in the cache. When this happens, the processor will then wait for data to be fetched from the next cache or cache level or from the main memory before continuing its execution. That is because cache misses can influence the application`s performance directly. In relation to Xeon Phi cache miss occurs on a core that generates address request on the address ring and then quires the tag directories. In this case if no data is found in the tag directories then the core will have to generate another address request and then query the memory for data. The diagram below shows the Miss rate versus cache size on the Integer portion of SPEC CPU2000.

Figure 3: Miss Rate versus cache size on the Integer portion of SPEC CPU2000 This diagram was retrieved from http://en.wikipedia.org/wiki/CPU_cache#Cache_miss References http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual_html_li nux/manual_html/miss_ratio.html