<<

TECHNOLOGY ROADMAP DOCUMENT FOR SKA SIGNAL PROCESSING

Document number ...... WP2‐040.030.011‐TD‐001 Revision ...... 1 Author ...... W.Turner Date ...... 2011‐02‐27 Status ...... Issued

Name Designation Affiliation Date Signature Additional Authors

Submitted by:

W. Turner Signal Processing SPDO 2011‐03‐26 Domain Specialist

Approved by:

P. Dewdney Project Engineer SPDO 2010‐03‐29

WP2‐040.030.011‐TD‐001 Revision : 1

DOCUMENT HISTORY

Revision Date Of Issue Engineering Change Comments Number

1 ‐ ‐ First issue

DOCUMENT SOFTWARE

Package Version Filename

Wordprocessor MsWord Word 2007 02‐WP2‐040.030.011.TD‐001‐1_SKATechnologyRoadmap

Block diagrams

Other

ORGANISATION DETAILS

Name SKA Program Development Office Physical/Postal Jodrell Bank Centre for Astrophysics Address Alan Turing Building The University of Manchester Oxford Road Manchester, UK M13 9PL Fax. +44 (0)161 275 4049 Website www.skatelescope.org

2011‐02‐27 Page 2 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

TABLE OF CONTENTS

1 INTRODUCTION ...... 9 1.1 Purpose of the document ...... 9 1.2 Technology Readiness Levels ...... 10

2 REFERENCES ...... 12

3 PROCESSING ...... 14 3.1 General Purpose Processor ...... 14 3.1.1 Theoretical Processing Performance ...... 17 3.1.2 Cost ...... 17 3.1.3 Thermal Dissipation ...... 17 3.1.4 Scalability ...... 18 3.2 ...... 19 3.2.1 ...... 19 3.2.2 ATI (AMD) ...... 21 3.2.3 ...... 22 3.2.4 Theoretical Processing Performance ...... 23 3.2.5 Cost ...... 24 3.2.6 Thermal Dissipation ...... 24 3.3 Field Programmable Gate Array...... 25 3.3.1 Theoretical Processing Performance ...... 28 3.3.2 Cost ...... 28 3.3.3 Thermal Dissipation ...... 30 3.3.4 Hard Copy ...... 31 3.4 Application Specific ASIC ...... 31 3.4.1 Process Size ...... 31 3.4.2 Masking Costs ...... 35 3.4.3 Yield and Costs ...... 35 3.4.4 Prototyping ...... 37 3.5 Gap between FPGAs and ASICS ...... 37 3.5.1 Theoretical Processing Performance ...... 38 3.5.2 Cost ...... 38 3.5.3 Thermal Dissipation ...... 39 3.6 Network on Chip, NoC...... 39

4 STORAGE ...... 42 4.1 SRAM ...... 45 4.1.1 SRAM performance ...... 46 4.1.2 SRAM Thermal Dissipation ...... 46 4.1.3 SRAM Cost ...... 46 4.2 Dynamic Random Access Memory, DRAM ...... 46

2011‐02‐27 Page 3 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

4.2.1 DRAM Performance ...... 47 4.2.2 DRAM Cost ...... 48 4.2.3 DRAM Thermal Dissipation ...... 49 4.3 ...... 50 4.3.1 NAND Cost ...... 52 4.3.2 NAND Thermal Dissipation ...... 52 4.4 Storage Class Memory ...... 52 4.4.1 SCM Performance ...... 53 4.4.2 SCM Cost ...... 54 4.4.3 SCM Thermal Dissipation ...... 54

5 DISK STORAGE ...... 54 5.1.1 Disk Performance ...... 55 5.1.2 Disk Thermal Dissipation ...... 56 5.1.3 Disk Cost ...... 56

6 NETWORK ...... 57 6.1 Infiniband ...... 57 6.1.1 Infiniband Performance Roadmap ...... 57 6.1.2 Host Channel Adapters ...... 58 6.1.3 Infiniband switches ...... 58 6.2 Ethernet ...... 59 6.2.1 100 G bit/s Ethernet Switches ...... 60 6.2.2 Terabit Ethernet ...... 60 6.2.3 Ethernet Cost ...... 60 6.2.4 Ethernet Thermal Dissipation ...... 60 6.3 Optical Interconnect ...... 61 6.3.1 Performance...... 63 6.3.2 Thermal Dissipation ...... 64 6.3.3 Cost ...... 64

7 APPENDIX 1 ...... 64 7.1 Moore’s Law ...... 64 7.2 Transistor Size ...... 66 7.3 Breaking Moore’s Law ...... 67 7.4 Moore’s Law and Processing Capability ...... 67

8 APPENDIX 2 ...... 68 8.1 Tilera ...... 68 8.2 Clearspeed ...... 69 8.3 PicoChip...... 70 8.4 Other Technologies ...... 71

2011‐02‐27 Page 4 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

LIST OF FIGURES

Figure 1 Computations per kilowatt hour over time ...... 15 Figure 2 Intel’s Tick Tock Roadmap ...... 16 Figure 3 Parallel speed up ...... 18 Figure 4 Intel’s Science Computing Road‐Map ...... 20 Figure 5 Intel Roadmap ...... 20 Figure 6 ATI Graphics accelerator with 8 GPU cards ...... 21 Figure 7 NVIDIA GPU Historic Roadmap ...... 22 Figure 8 NVIDIA Tesla S2050unit plan view ...... 23 Figure 9 Tesla S2050 Architecture ...... 23 Figure 10 CUDA GPU Processing power per Watt Road‐map ...... 24 Figure 11 Gates per unit area of silicon as a function of process size ...... 32 Figure 12 IBM ASIC Gate Delays ...... 32 Figure 13 IBM ASIC Dynamic Power...... 33 Figure 14IBM ASIC Static Power ...... 34 Figure 15 Total chip dynamic and static power dissipation trends ...... 34 Figure 16 Mask Tooling Costs ...... 35 Figure 17 Example NoC and processing Tile ...... 40 Figure 18 Silicon Implementation ...... 40 Figure 19 Artist’ concept of 3D silicon processor chip with optical IO layer featuring on‐chip nanophotonic network ...... 42 Figure 20 Storage taxonomy ...... 43 Figure 21 Storage Hierarchy...... 44 Figure 22 Samsung’s Memory Technology and Solutions Roadmap ...... 44 Figure 23 Samsung’s DRAM Historic Roadmap ...... 47 Figure 24 Samsung’s DRAM Historic Roadmap ...... 47 Figure 25 Samsung DDR DRAM Performance Roadmap ...... 48 Figure 26 DRAM Chip Selling Price December 2010 ...... 49 Figure 27 Samsung DRAM: Measured Thermal Dissipation ...... 49 Figure 28 NAND and NOR Flash Memory Schematics and Cell layout ...... 50 Figure 29 Intel Micron Historic Flash Roadmap ...... 51 Figure 30 NAND Cost per M Byte Road Map ...... 52 Figure 31 SCM Roadmap in relation to NAND, DRAM and Hard Disk (HDD) ...... 54 Figure 32 Historic Roadmap for Disk Areal Density ...... 55 Figure 33 Historic Roadmap for Disk Bandwidth ...... 55 Figure 34 Infiniband Roadmap ...... 58 Figure 35 Ethernet PHY standards ...... 59 Figure 36 Alcatel Lucent Power Consumption Roadmap ...... 61 Figure 37 CFP Hardware Specification Power Interlock...... 61 Figure 38 IBM Terra Bus Overview ...... 62 Figure 39 IBM Terrabus Integrated Circuit Connectivity ...... 63 Figure 40 IBM Terrabus Integrated Circuit and Printed Circuit board Optical Connectivity ...... 63

2011‐02‐27 Page 5 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 41 Numbers of Transistors for Intel Processors ...... 64 Figure 42 ITRS transistor cost predictions ...... 65 Figure 43 Roadmap of Transistor Size ...... 66 Figure 44 Physical Scaling of Parameters for a Semi‐conductor gate ...... 67 Figure 45 Tilera Tile Processor architecture ...... 69 Figure 46 Clearspeed’s CSX 700 ...... 70 Figure 47 PicoChip’s Pico Array Architecture...... 71

LIST OF TABLES

Table 1 Technology readiness levels as risk likelihood indicators ...... 10 Table 2 Technology Readiness Level Definitions ...... 11 Table 3 Intel’s Tick Tock Time Line ...... 16 Table 4 Current Virtex 6 product range ...... 26 Table 5 Xilinx Next Generation FPGA (Virtex 7) ...... 27 Table 6 Xilinx pricing on 29th December 2010 for Virtex 6 Devices ...... 29 Table 7 FPGA to ASIC Gap Summary ...... 37 Table 8 NoC Packet transmission Energies ...... 41 Table 9 Current Baseline and Prototypical Memory Technologies (ITRS 2007) ...... 45 Table 10 parameter growth ...... 65 Table 11 Device Scaling factors ...... 66

2011‐02‐27 Page 6 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

LIST OF ABBREVIATIONS

AA ...... Aperture Array Ant...... Antenna API ...... Application Programming Interface ASIC ...... Application Specific Integrated Circuit BER ...... Bit Error Rate CAD ...... Computer Aided Design CAGR ...... Compound Annual Growth Rate CoDR ...... Conceptual Design Review COTS ...... Commercial off te Shelf cm ...... centmetre CPU ...... DDR ...... Double Data Rate DOD ...... Department of Defence DRAM ...... Dynamic Random Access Memory DRM ...... Design Reference Mission DSP ...... Digital Signal Processor EDA ...... Electronic Design Automation EoR ...... Epoch of Reionisation EX ...... Example FFT ...... Fast Fourier Transform FLOPS ...... Floating Point Operations per second FoV ...... Field of View FPGA ...... Field Programmable Gate Array GPU ...... Graphics Processing Unit HCA ...... Host Channel Adapter HDD ...... Hard Disk Drive HDL ...... High Definition Language HDR ...... High Data Rate Hz ...... Herz IDR ...... Internal Data Rate IFFT ...... Inverse Fast Fourier Transform I/O ...... input/ output IP ...... Intellectual Property K ...... Kelvin LNA ...... Low Noise Amplifier MAC ...... Multiply Accumulate MLM ...... Multi-Layer Mask

2011‐02‐27 Page 7 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

MMF ...... Multi Mode Fibre MPW ...... Multi-Project MW ...... Mega Watt nm ...... nano metre NoC ...... Network on Chip NDA ...... Non Disclosure Agreement NDR ...... Next Data Rate NRE ...... Non Recurring Engineering Ny ...... Nyquist OH ...... Over Head ONoC ...... Optical Network on Chip OS ...... Operating System OTPF ...... Observing Time Performance Factor Ov ...... Over sampling PAF ...... Phased Array Feed PCI ...... Peripheral Component Interconect PCIe ...... PCI Express PrepSKA...... Preparatory Phase for the SKA Rd ...... read RFI ...... Radio Frequency Interference rms ...... root mean square RRAM ...... Resistive Random Access Memory SCM ...... Storage Centric Memory SEFD...... System Equivalent Flux Density SER ...... Soft Error Rate SKA ...... Square Kilometre Array SKA1 ...... SKA Phase 1 SKA2 ...... SKA Phase 2 SKADS ...... SKA Design Studies SMF ...... Single Mode Fibre SPDO ...... SKA Program Development Office SRAM ...... Static Random Access Memory SSD ...... Solid State Drive SSFoM ...... Survey Speed Figure of Merit TBD ...... To be decided TRL ...... Technology Readiness Level Wr ...... write Wrt ...... with respect to

2011‐02‐27 Page 8 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

1 Introduction The aim of this document is provide an overview of the technology that could potentially form the basis of the signal processing for the SKA telescope. It is intended that this document should be reviewed and updated on an annual basis leading up to phase 1 and phase 2 of the telescope to provide an up to date perspective as input to the technology selection process. This is intended to be a complementary activity abstracted from specific Concept Designs. Consequently, the document focus is the technology options and their attributes rather than design details. It is intended that the document should provide a wide coverage of technology; however, the level of detail provided on specific technologies will be proportional to the perceived relevance of the technology at the time of writing. One limitation of this document is that its scope is restricted to information available in the public domain. For obvious reasons, commercial manufacturers tend to be quite guarded about their specific road maps and may only release details under Non Disclosure Agreements, NDAs. However, this is not considered a major limitation in providing a reasonable overview for a technology roadmap particularly one that is to be updated on an annual basis. This document is part of a series generated in support of the Signal Processing CoDR which includes the following:

 Signal Processing High Level Description

 Technology Roadmap

 Design Concept Descriptions

 Signal Processing Requirements

 Signal Processing Costs

 Signal Processing Risk Register

 Signal Processing Strategy to Proceed to the Next Phase

 Signal Processing Co DR Review Plan

 Software & Firmware Strategy

1.1 Purpose of the document

The overall purpose of this document is to identify the road map of processing and communication technology applicable to the SKA signal processing. This is to include:

 Identify known potential technologies applicable to the SKA

 Where possible project attributes of known technology to the time frame of the SKA in terms of:

o Performance

o Cost

2011‐02‐27 Page 9 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

o Thermal Dissipation

 Provide an overview of potential future technologies that may be applicable to the SKA within the time frame of the SKA1 or SKA2.

 List ‘also ran’ technologies that have been considered but have been considered unsuitable in their current format

1.2 Technology Readiness Levels

For a document detailing a technology roadmap the issue of technology readiness needs to be raised. The Risk Management PLAN MGT‐040.040.000‐MP‐001 iss 1 proposes that a condensed version of the United States Department of Defence (DOD) and NASA technology readiness levels (TRL) be used to estimate the likelihood of occurrence for the relevant technology and these are shown in Table 1

Table 1 Technology readiness levels as risk likelihood indicators

It is important to note that the technology readiness may differ from one hierarchical level to the next. For example ‐ individual components may be freely available implying that the risk for procurement at the component level is low. However, if these components have not yet been integrated and shown to fulfil the required functions in the required environment at the next hierarchical level, the risk at this higher level will be high.

The definitions of the technology readiness levels are shown in Table 2. These definitions should be taken into account along with the risk likelihood level when using the roadmap to inform any concept implementation.

2011‐02‐27 Page 10 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Table 2 Technology Readiness Level Definitions

2011‐02‐27 Page 11 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

2 References [1] International Technology Roadmap for (ITRS), available at www.itrs.net. [2] Terrabus: a Chip‐to‐Chip Parallel Optical Interconnect J A Kash et al. [3] Progress in Digital Integrated Electronics G Moore Technical Digest‐IEEE Int’l Electronic Devices Meeting Vol 21 1975 pp 11‐13 [4] Establishing Moore’s Law Ethan Mollick IEEE Annals of the History of Computing vol 28 No. 3 2006 pp 62 ‐ 75 [5] Three Steps to the Thermal Noise Death of Moore’s Law Jacek Izydorczyk IEEE trans VLSI Systems Vol 18 No.1 2010 pp 161 ‐ 165 [6] Limits to Binary Logic Switch Scaling—A Gedanken Model Victor V. Zhirnov et al Proc. IEEE vol 91 no 11 2003 pp 1934 ‐ 1939 [7] Limits on Silicon for Terascale Integration J. D Meindl Vol293 Science [8] Microprocessor Scaling: What Limits Will Hold? Jacek Izydorczyk IEEE Computer Aug 2010 [9] Emerging Research Memory and Logic Technology J A Hutchby et al. IEEE Circuits & Devices Magazine vol 21 No. 3 2005 pp 47 – 51 [10] Future Trends in Microelectronics S Luryi, J Xu & A Zaslavsky John Wiley & Sons [11] The High‐K Solution M T Bohr, R Chau & K Mistry IEEE Spectrum vol 44 No. 10 2007 pp 29‐ 35 [12] Quantifying and Exploring the Gap Between FPGAs and ASICS Ian Kuon & Jonathan Rose Springer [13] Explaining the gap between ASIC and custom power: a custom perspective A Chang, W J Dally DAC ’05 Proceedings of the 42nd annual conference on Design automation pp 281 – 284 ACM New York 2005 [14] Closing the Gap Between ASIC & Custom Tools and Techniques for High‐Performance ASIC Design D.Chinnery, K Keutzer Kluwer New York 2002 [15] Closing the Power Gap Between ASIC and Custom: an ASIC perspective. DAC ’05 Proceedings of the 42nd annual conference on Design automation pp 275 – 280 ACM New York 2005 [16] The role of custom design in ASIC chips DAC ’00 Proceedings of the 37th annual conference on Design automation pp 643 – 647 ACM New York 2005 [17] J G. Koomey Assessing Trends in the Electrical Efficiency of Computation Over Time report to and Intel Corporations [18] Computer Architecture a Quantitative Approach Hennessy and Patterson [19] A 51mW 1.6 GHz on‐chip network for low‐power heterogeneous SoC platform Kangmin Lee et al, IEEE Int. Solid‐States Circuit Conference, Digest of Technical papers, pp 152‐512 Feb 2004 [20] An 800MHz star‐connected on‐chip network for application to systems on a chip: Se‐Joong Lee et al, IEEE Int. Solid‐States Circuits Conf. Digest of Technical papers, pp.468‐469 Feb 2003 [21] Low‐Power NoC for High‐Performance SoC Design, Hoi‐Jun Yoo, Kangmin Lee, Jun Kyoung Kim, CRC Press 2008 An 80‐Tile 1 .28 TFLOPS Network‐on‐Chip in 65nm CMOS, Sriram Vangali, Jason Howard, Gregory Ruhl, Saurabh Dighe, Howard Wilson, James Tschanz, David Finan, Priya Iyerl, Arvind Singh, Tiju

2011‐02‐27 Page 12 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Jacob, Shailendra Jain, Sriram Venkataraman, Yatin Hoskote, Nitin Borkar ISSCC 2007/1 Session 5/1 Microprocessors / 52

2011‐02‐27 Page 13 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

3 Processing The scale of the SKA Signal Processing has some onerous processing and signal transport requirements due to its sheer scale whilst being constrained by cost and thermal dissipation.

Of the potential solutions, four processing technologies are currently popular with astronomy engineering community and potentially offers solutions within the timeframe of the SKA:

 General Purpose Processor

 GPU

 FPGA

 ASIC

However, there are other interesting developments that aren’t in the mainstream that could potentially pave the way to a solution. The Appendix details some of these options.

3.1 General Purpose Processor

The term general purpose processor is nominally used to identify x86 architecture processors manufactured by Intel and AMD and are typically programmed in a high level language. Other processors also fall into this category such as Motorola’s Vector processing and Sun’s Niagara. Each of these processors is aimed at providing a highly flexible programming platform coupled to a supporting an Operating System, OS. One cost of providing this general purpose capability is the power efficiency of the platform that requires extra hardware to support the inbuilt flexibility. For example, the processing unit will typically be 32 or 64 bit floating point irrespective of the data word length. A metric typically used to indicate the processing efficiency is processing capability per kilowatt, kW.

Figure 1 details the roadmap of the theoretical processing capability per kW hour of dissipation for general purpose computer over the period 1945 through to 2010. Projecting from this graph suggests 2.7 x 1016 computations per kW hour by 2015 or alternatively 7.5 x 1015 computations per second per Mega watt dissipation. An industry target of ~20 MW exists for Exascale computing by 2020. This can be shown to be consistent with projections from Figure 1.

2011‐02‐27 Page 14 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

(J G. Koomey Stanford)[17]

Figure 1 Computations per kilowatt hour over time

At present (October 2010) Intel processor chips dominate the Top 500 supercomputers with over 80% of processors being Intel. On this basis, the roadmap of Intel processors is presented as being representational of the roadmap for x86 architecture general purpose processors. The information presented is in the public domain and has largely been harvested from the Internet including Intel’s own web‐site.

Intel’s strategy for processor developments is based on a time line known as ‘the Tick Tock roadmap’ and is detailed in Figure 2. The Tick of the time line represents a process change and the Tock represents a processor architecture change. The current technology is at a 45nm process with the Nehalem architecture. The top end performance of the 45nm technology is likely to be achieved with the ‘Beckton’ Xeon processor which should provide 8 processor cores running at up to 2.3 GHz for 130 Watts processor dissipation and at a unit price of $3.7k.

2011‐02‐27 Page 15 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 2 Intel’s Tick Tock Roadmap

Architecture Change Fabrication Release Energy scaling Delay Scaling Process Date

Tick Shrink/derivative (Penryn) 45nm 2008 0.5 > 0.7

Tock (Nehalem) 2009

Tick Shrink/derivative (Westmere) 32nm 2010 0.5 > 0.7

Tock Microarchitecture (Sandy 2011 0.5 > 0.7 Bridge)

Tick Shrink/derivative Ivy Bridge 22nm 2012 0.5 > 0.7

Tock Microarchitecture Haswell 2013 0.5 > 0.7

Tick Shrink/derivative Rockwell 16nm 2014 0.5 ~1

Tock Microarchitecture TBD 2015 0.5 ~1

Table 3 Intel’s Tick Tock Time Line

Table 3 summarises Intel’s tick‐tock roadmap process through to the 16nm process. Intel also has some more speculative projections through to 4nm technology by 2022.

These figures suggest that there should be a factor of two improvement in thermal dissipation for the same processing capability for each die shrink. To achieve this presents some technical challenges as leakage current becomes more of a problem as feature size is reduced. A discussion of this issue is provided later in the document as it is applicable to other processing technologies too.

2011‐02‐27 Page 16 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Another major architectural limitation is the thermal density achievable by the processor chip’s packaging which is currently of the order of 140 W per cm2 for a commercial 2 dimensional device. It is this limitation that has recently brought a halt to ever increasing processor clock rates and driven the architecture down the path of multi‐core processing. The use of three dimensional packaging can provide a one off step improvement on the achievable thermal density.

3.1.1 Theoretical Processing Performance

Typically, the theoretical maximum processing power, in G FLOPS, offered by a single general purpose (x86) processor is:

_

The “” technology refresh is due for 2011 and there are already provisional figures available for processor chips such as the Core i7‐2600K aimed at desktop applications. This is a quad core device clock at up to 3.8 GHz. Consequently:

4 3.8 2 _ 30.4

From Table 3, it is expected that there will be four future generations of processor by the year 2015 with the theoretical processing power speculatively increasing by a factor of 24 = 16

_ 490

3.1.2 Cost

The “Sandy Bridge” technology refresh is due for 2011 and there are already provisional figures available for processor chips such as the Core i7‐2600K aimed at desktop applications. The chip is due to replace is the 3.4 GHz i5‐2600 which are currently ~$300$. Intel generally drops in CPUs ~10/20$ over their targeted replacements.

It is expected that a current generation processor chip will cost a similar amount in 2015.

3.1.3 Thermal Dissipation

Table 3 provides details of the expected energy scaling for Intel chips with a doubling of processing power for the same thermal dissipation for each technology generation. This is a quad core device that can be clocked at up to 3.8 GHz and is expected to dissipate 95W.

It should be pointed out that the thermal dissipation depends on the processing load with 95W being dissipated at 100% loading for the processing chip only. The processing load at idle (0% processing load) will be reasonably high and could possibly be as high as 30 to 40 Watts. External memory and interface electronics will also add to the thermal dissipation for a computing node.

Higher performance “server” grade processor chips are expected to dissipate ~ 130W

It is expected that the thermal dissipation of a current generation processor chip will be at a similar level in 2015.

2011‐02‐27 Page 17 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

3.1.4 Scalability

To provide high levels of computing power many general purpose processors may be run in parallel. A naive assumption would be the achievable processing power would scale linearly with the number of processors utilised. However this is not the case as can be seen in Figure 3

Figure 3 Parallel speed up

In this figure, the speed up for several arbitrary applications has been measured as a function of the number of processor cores used to provide the processing for the application. The measurements are for processors on separate chips rather than multiple cores on a chip. As can be seen, the results are varied depending on the application. Some applications see little speed up beyond 32 cores. The effect isn’t as pronounced for multiple cores on the same chip but Figure 1 is useful to illustrate the phenomenon of diminishing returns that can be attributed to Amdahl’s Law:

The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

1 1 /

Where n is the number of processors, and f is the fraction of computation that programmers can parallelize (0 ≤ f ≤ 1). An article that applies this principle to evaluate potential architectures of multi‐ core processors:

Extending Amdahl’s Law for Energy‐Efficient Computing in the Many‐Core Era; Dong Hyuk Woo and

Hsien‐Hsin S. Lee Georgia Institute of Technology

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4712496

2011‐02‐27 Page 18 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

3.2 Graphics Processing Unit

A graphics processing unit, GPU, is a specialized processor that offloads and accelerates graphics rendering from the central processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general‐purpose CPUs for a range of simple algorithms. Because most of these computations involve matrix and vector operations, the GPU has, over the last few years, been adapted for use as a processing accelerator particularly within the engineering and science domains. For example, the current fastest super‐ computer in the top 500 (http://www.top500.org/system/10587 ) is Tianhe‐I in Tianjin China achieves 2.566 Peta FLOPS with the aid of GPU accelerators. This computer cost $88M to build and $20M per annum in energy and maintenance costs. The architecture is based on compute nodes containing two Xeon X5670 6‐core processors and one Nvidia Tesla M2050 GPU processor. The system in total contains 7168 GPUs, and 14,336 CPUs.

Programming GPUs can be problematic. Although NVIDIA and ATI have endeavoured to provide a programming environment and library sets through programming languages such as the vendor specific CUDA and more recently Open CL (http://www.khronos.org/opencl/ ), these are largely tied in to GPU processing. CUDA (Compute Unified Device Architecture) provides an API extension to the C programming language, which allows specified functions from a normal C program to run on the GPU's stream processors. This makes C programs capable of taking advantage of a GPU's ability to operate on large matrices in parallel, while still making use of the CPU when appropriate. CUDA is also the first API to allow CPU‐based applications to access directly the resources of a GPU for more general purpose computing without the limitations of using a graphics API. OpenCL is a collaboration between ATI and NVIDIA and claims to be “an open, royalty‐free standard for cross‐platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices.”

In 2008, Intel, NVIDIA and AMD/ATI were the market share leaders, with 49.4%, 27.8% and 20.6% share respectively. However, those numbers include Intel's integrated graphics solutions as GPUs. Excluding those numbers, NVIDIA and ATI control nearly 100% of the market. The following sections provides a roadmap for GPU products from these three companies

3.2.1 Intel

Intel has presented an ambitious road‐map identifying science computing requirements to the year 2029 (Figure 1) taken from: http://download.intel.com/pressroom/archive/reference/ISC_2010_Skaugen_keynote.pdf .

2011‐02‐27 Page 19 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 4 Intel’s Science Computing Road‐Map

In support of the feasibility of this road map details of production, development and research associated with achieving the time lines have been presented and are detailed in Figure 5

Figure 5 Intel Roadmap

This roadmap includes a 22nm “Many Integrated Core” Processor derived from Intel’s cancelled project for a General Purpose GPU chip known as Larrabee. This processor is compatible with the standard Intel Architecture programming and memory model which eliminates the need for a dual programming architecture currently required for NVIDIA and ATI GPUs and is compatible with existing C, C++, and Fortran compilers for the Intel Xeon.

Initial implementation will be a 32 core device clocking at 1.2 GHz with 8 M Bytes of shared coherent cache under the code name of a software development platform known as “Knights Bridge” and has been already been demonstrated in 2010.

“Knights Corner” is the next generation and will offer over 50 processor cores and is expected to be available in Q3/4 of 2011 with further “Knights” products leveraging Moore’s Law.

2011‐02‐27 Page 20 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

3.2.2 ATI (AMD)

As mentioned previously, ATI and NVIDIA dominate the non embedded graphics market in terms of sales. However, currently, ATI don’t seem to be marketing the use of their GPU products for HPC as hard as NVIDIA are with their products. For example, there is a fairly limited amount of information on the ATI website:

http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/STREAM‐TECHNOLOGY/Pages/stream‐ technology.aspx

The current top of the range GPU card aimed at streamed processing is the AMD FireStream 9170 based on:

http://www.amd.com/us/Documents/AMD_FS9170_051908.pdf

This graphics card contains 800 55nm processor cores providing 1.2 Tera Flops of single precision or 240 G Flops double precision processing capability.

The thermal dissipation for the card, including memory and other support hardware, is claimed to be 160 Watts typical and < 220 Watts peak. AMD claim 4 G FLOPS per Watt capability though it isn’t clear whether this is for just the GPU processor chip.

The memory interface on the graphics card is 256 bit bits wide clocking at 800 MHz which provides 110 G Bytes/s capability.

The GPU card needs to be supported by a host server (Figure 6) to provide the I/O interface which is 16 lanes second generation PCI express. Each lane of PCI express 2.1 is serial running encoded at 5 G bits/s meaning the theoretical I/O bandwidth payload is 64 G bit/s. The PCI express v 3.0 was ratified in November 2010 and includes on the wire bit rates of 8 G bit/s. If multiple GPUs are used in the same host this bandwidth will be limited by the PCI express root complex within the server. In addition, aspects of the server architecture will also impact the achievable data rate and may make the GPU I/O bound in terms of its processing power

Figure 6 ATI Graphics accelerator with 8 GPU cards

2011‐02‐27 Page 21 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

At present the roadmap for AMD isn’t well publicised by AMD on their web‐site, however, details of the AMD Firestream 9370 have been tracked down:

http://www.amd.com/us/press‐releases/pages/firestream‐peak‐performance‐2010june23.aspx

This press release provides “planned” specifications for the AMD FireStream 9370. It is claimed it will deliver a theoretical 2.64 TFLOPS of single precision performance and 528 GFLOPS of double‐ precision performance for a maximum board dissipation of 225 watts. Release date should have been Q4 2010, however, a search of the Internet in early January 2011 couldn’t locate a unit for sale. The suggested price is ~ $2k.

Several AMD technology partners and OEMs plan to offer rack mounted servers and expansion systems featuring AMD FireStream 9350 and 9370 accelerators, including:

One Stop Systems: http://www.onestopsystems.com/

Supermicro: http://www.supermicro.com/index.cfm

3.2.3 NVIDIA

NVIDIA arguably have the strongest presence in the GPU streamed processing market via their well established but proprietary Computer Unified Device Architecture. Figure 7 provides an overview of how this architecture has developed over the last few generations leading up to the current Fermi product offering 512 processing cores per chip providing 512 single precision operations per clock.

Figure 7 NVIDIA GPU Historic Roadmap

2011‐02‐27 Page 22 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

In addition to providing GPU processing chips and cards, NVIDIA are also providing support hardware for large scale installations in the form of the Tesla S2050 1 U Computing system:

http://www.nvidia.com/object/product‐tesla‐S2050‐us.html

Figure 8 NVIDIA Tesla S2050unit plan view

As illustrated in Figure 9, the S2050 can host up to 4 GPU processing units and provides the required power supplies and thermal management. Communication to the GPUs is via NVIDIA PCIe switches incorporating in the chassis.

Figure 9 Tesla S2050 Architecture

The S2050 still requires a host system and communicates to it via PCI‐express cables.

3.2.4 Theoretical Processing Performance

Typically, the theoretical maximum processing power, in G FLOPS, offered by a single NVIDIA GPU processor is:

2011‐02‐27 Page 23 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

_

A “Tesla” GPU platform comprising of 4 Fermi GPUs each with 512 cores clocking at up to 1.5 GHz. Consequently:

_ 512/ 4/ 1.5 1 _ 6

In 2009 William J. Dally the Chief scientist with the Nvidia Corporation delivered a keynote address to the Design Automation Conference predicting the roadmap for NVIDIA graphics processors. This stated that graphics processors will have thousands of cores by 2015 implemented on 11 nm process technology. In particular, they will feature roughly 5,000 cores and provide up to 20 teraflops of performance.

_ 20

3.2.5 Cost

The “Tesla” platform is currently available with “Fermi” GPU technology. The cost of a 1 U S2050 housing containing 4 Fermi GPUs providing 2048 processing cores and 2 off PCIe 16x interfaces is of the order $12k:

http://www.morecomputers.com/extra.asp?pn=tcss2050‐1/2mx16‐pb&referer=FroogleA

It is expected that each generation GPU processing platform will cost a similar amount.

3.2.6 Thermal Dissipation

Figure 10 CUDA GPU Processing power per Watt Road‐map

Figure 10 provides details of the expected processing performance per Watt scaling for NVIDIA CUDA GPU family for each technology generation up to 2013. It is assumed a further generation will be available for 2015.

2011‐02‐27 Page 24 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

It should be pointed out that the thermal dissipation depends on the processing load with the maximum being associated with 100%. On current generations of NVIDIA GPU it has been observed that the thermal dissipation at idle (0% processing load) is high too. The CUDA platform also has to be associated with a host server which will also contribute to the thermal dissipation which may be of the order of 200 to 300 Watts. However, up to 8 GPUs may be hosted by the same server though these will have to share the PCIe interface bandwidth.

Each GPU rack housing the 8 GPUs is expected to dissipate up to ~ 900W

It is expected that the thermal dissipation of a current generation graphics cards will be at a similar level in 2015.

3.3 Field Programmable Gate Array

Field Programmable Gate Arrays, FPGA, have been around since 1985 when Ross Freeman and Bernard Vonderschmitt of Xilinx produced the first commercially viable FPGA the XC2064. The FPGA is an integrated circuit designed to allow its hardware to be reconfigurable via a Hardware Description language. This is achieved by the use of "logic blocks", and a hierarchy of reconfigurable interconnects that allow the blocks to be patched together. These logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be simple flip‐flops or more complete blocks of memory.

Within the last few years some manufactures have been supplementing the general purpose logic blocks with multiple embedded cores providing Digital Signal Processing, DSP, and high speed serial (multi Giga bit/s) transceivers as well as control micro processor cores. The DSP cores are typically fixed width (18 bit) multiply accumulators that can be linked to the surrounding logic blocks. For the SKA signal processing, 18 bit integer processing is as effective as 32 bit floating point processing. The serial transceivers can be configured to be compatible with the physical layer of commercial communication standards. These recent developments in FPGA architecture coupled with their ability to be reconfigured have made them a popular alternative to producing custom chip designs as risks associated with the development life cycle are significantly reduced.

Manufactures of FPGA devices include:

 Xilinx: http://www.xilinx.com/

: http://www.achronix.com/

: http://www.altera.com/

 Actel: http://www.actel.com/

 Aeroflex: http://www.aeroflex.com/ams/pagesproduct/prods‐hirel‐fpga.cfm

: http://www.atmel.com/products/fpga/default.asp

2011‐02‐27 Page 25 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

: http://www.latticesemi.com/products/fpga/index.cfm

 Quicklogic: http://www.quicklogic.com/

 Tabula: http://www.tabula.com/

 SiliconBlue Technologies: http://www.siliconbluetech.com/

Of these, Xilinx and Altera dominate the market with nominally a 50% and 30% share of the overall market respectively and consequently FPGAs from these companies provide the main focus of this document. However, the smaller companies tend to specialise in niche capability that is worth keeping an eye on. For example, Aeroflex specialise in radiation hardened FPGA solutions, Tabula in ultra fast (GHz) reconfigurability facilitating time multiplexed logic, SiliconBlue Technologies in ultra low power, Achronix in optimised fabric and Actel in mixed signal applications.

For simplicity of this document, Xilinx are used to provide a reference for the type of capability currently available from FPGAs and for projection of capability in the future. A similar analysis could be applied to Altera resulting in similar conclusions. The current range from Xilinx is the Virtex 6 range (Table 4) with details of the Virtex 7 family announced but not available until 2011.

Table 4 Xilinx Current Virtex 6 product range

2011‐02‐27 Page 26 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Table 5 Xilinx Next Generation FPGA (Virtex 7)

2011‐02‐27 Page 27 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

3.3.1 Theoretical Processing Performance

Assuming the maximum achievable for the Virtex 6 family of 600MHz applied to the SX475T component implies:

600 2016 1.2

Note that a MAC is multiply and accumulate within the same clock cycle and as such is the equivalent to two Ops.

Similarly the smaller SX315T component is theoretically capable of delivering 800 G MACS from its 1,344 DSP slices. In both cases the Multiply Accumulate is assumed to be 18 bits wide.

The top of the range Virtex 7 devices support 3960 DSP slices and are likely to clock at up to speeds of 600MHz.

600 3960 2.4

Based on the existing roadmap (Virtex5 Q2 2006, Virtex 6 Q2 2009 & Virtex 7 Q2 2012) of 3 years per FPGA generation, one further generation of FPGA (beyond Virtex 7) is expected in the time scale 2015/2016. Based on the existing road map this is expected to double the processing capability to 4.8 T MACS (for 18 bit data).

3.3.2 Cost

The cost of currently available Xilinx Virtex 6 FPGAs has been taken from the Avnet website on 29th December 2010 (hyperlinked from the Xilinx site):

http://www.xilinx.com/onlinestore/silicon/online_store_v6.htm

Device Unit Cost $ Unit Cost $ Unit Cost $ Notes

Qty: 1 off Qty: 500 off Qty: 1000+

XC6VLX130T‐1FFG484 911.97 885.91 873.44

XC6VLX130T‐2FFG484 1,140.71 1,108.11 1,092.51

XC6VLX130T‐1FFG784 1050.1 1020.1 1005.73

XC6VLX130T‐2FFG784 1,311.51 1,274.04 1,256.10

XC6VLX195T‐1FFG1156 1620.59 1574.29 1552.11

XC6VLX195T‐2FFG1156 2,026.47 1,968.57 1,940.85

2011‐02‐27 Page 28 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

XC6VLX240T‐2FFG784 2184.87 2122.44 2092.55

XC6VLX240T‐1FFG1759 2,306.66 2,240.76 2,209.20

XC6VLX240T‐2FF1759 2,884.44 2,802.03 2,762.56

XC6VLX365T‐1FF1759C 4,002.94 3,888.57 3,833.80

XC6VLX365T‐2FF1759C 5,004.41 4,861.43 4,792.96

XC6VLX550T‐1FF1759C 5,336.76 5,184.29 5,111.27

XC6VLX550T‐2FF1759C 6,672.06 6,481.43 6,390.1400

XC6VLX760‐1FFG1760C 15,622.06 15,175.71 14,961.97

XC6VLX760‐2FFG1760C 19,527.94 18,970.00 18,702.82

XC6VSX315T‐1FF1156C 3,245.59 3,152.86 3,108.45

XC6VSX315T‐2FF1156C 4,055.88 3,940.00 3,884.51

XC6VSX315T‐1FF1759C 3,732.35 3,625.71 3,574.65

XC6VSX315T‐2FFG1759C 4,664.71 4,531.43 4,467.61

XC6VSX475T‐1FF1156C 8,707.35 8,458.57 8,339.44

XC6VSX475T‐2FFG1156C 10,883.82 10,572.86 10,423.94

XC6VSX475T‐2FFG1759C 12,516.18 12,158.57 11,987.32

XC6VHX250T‐1FF1154 3980.88 3867.14 3812.68

XC6VHX250T‐2FF1154 4975.00 4832.86 4764.79

XC6VHX255T ‐ ‐ ‐ No pricing available

XC6VHX380T ‐ ‐ ‐ No pricing available

XC6VHX565T ‐ ‐ ‐ No pricing available

Table 6 Xilinx pricing on 29th December 2010 for Virtex 6 Devices

The table above provides a wide coverage of Xilinx’s component range including different speed grades and packaging options. Of these the devices supporting DSP functionality are probably of the most interest for the SKA signal processing and are highlighted in the table. An interesting observation is that although the SX475‐2 part provides 1.5 times the number of DSP cores its cost is 3 times higher.

Pricing for the Virtex 7 series of devices is not yet available. However, it is considered a reasonable assumption that new generation devices will be at a similar level to the devices they are replacing.

2011‐02‐27 Page 29 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

It should be noted that contract negotiations with the manufacturer should be able to reduce these prices down as with the other technologies detailed in this document.

3.3.3 Thermal Dissipation

Quoting the thermal dissipation as a function of processing power for an FPGA is difficult it is highly dependent on the implementation and layout of the device.

However a rule of thumb figure that has been used within the astronomy community is 25 GMACS per Watt for the Virtex 6 technology. Whether this figure is justified needs some empirical justification:

ASKAP’s complete digitiser design has 356 multipliers operating at 384MHz and 303MHz giving a total of 110.784G multiplies for 11.3W or 9.8G multiplies/W. However this number includes a lot of power dissipated in RAM, IO and logic cells and is to some extent dependent on the implementation.

The power breakdown is:

• 1.12W for clocks

• 0.8W for Logic

• 1.56W for routing

• 2.19W for RAM

• 0.9W for Multipliers

• 0.3W for PLLs

• 1.1W for IO

• 1.7W for 3G Serial IO

• 1.7W for leakage

So just the multipliers on their own give a much better figure of 123 G multiplies/W.

The pre‐release documentation for the Virtex 7 details figures for the improvements over the Virtex 6 including:

 65% lower static power consumption

 25 to 30% lower dynamic power consumption

 30% lower I/O dynamic power consumption

Over all it is expected the Virtex 7 should be able to provide twice the processing power for nominally the same thermal dissipation. If ASKAP’s empirical Virtex 6 data is representational, one might expect 20 G MACS per Watt. Assuming top end performance of 2.4 T MACS per device, this translates to a thermal dissipation of ~ 120 Watts.

2011‐02‐27 Page 30 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

3.3.4 Hard Copy

The Hard Copy process resides in the territory between FPGAs and ASICS. It allows an application developed in the FPGA domain to be hard coded into silicon. This has the advantages of lowering device cost significantly and the reduction of thermal dissipation. Of course the programmable flexibility and the ability to reconfigure the device are lost.

Cost figures are not available at the time of writing this document.

Thermal dissipation is expected to be of ~ 50% of the equivalent FPGA device. This might suggest 4.8 T MACS per device for ~ 25 Watts thermal dissipation.

3.4 Application Specific Integrated Circuit ASIC

An Application‐Specific Integrated Circuit (ASIC) is an integrated circuit designed specifically for a particular use, rather than a general‐purpose device. Typically the design is implemented at the transistor/ gate level or utilising the manufacturer’s libraries or third party Intellectual property for common functions. The benefits of full‐custom ASIC design usually include reduced silicon area (and therefore recurring component cost) and performance improvements including the ability to minimise thermal dissipation.

The disadvantages of full‐custom design can include increased manufacturing and design time, increased non‐recurring engineering (NRE) costs, more complexity in the computer‐aided design (CAD) system and a much higher skill requirement on the part of the design team.

However for digital‐only designs, "standard‐cell" cell libraries together with modern CAD systems can offer considerable performance/cost benefits with low risk. Automated layout tools are quick and easy to use and also offer the possibility to "hand‐tweak" or manually optimise any performance‐limiting aspect of the design.

Establishing the cost and performance of an ASIC solution is slightly more complicated than buying an off the shelf solution such as an FPGA or GPU as decisions have to be made about which process size to use. The following sections provide an overview of how this decision impacts on the cost of the solution through aspects such as Masking Costs, Yield and Packaging of the resultant silicon.

3.4.1 Process Size

The process size of an ASIC refers to the resolution of the mask lithography associated with the creation of each layer of the ASIC. This resolution determines the number of gates (and hence logic design) that can be accommodated within an area of silicon as illustrated in Figure 1. This graph is only an approximation as the packing density will depend on whether the device is auto routed or hand packed.

2011‐02‐27 Page 31 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Gates per mm2 3500 3000 2500 gates 2000 of

1500 1000 Number 500 0 0 100 200 300 400 500 600 700

Process Size nm

Figure 11 Gates per unit area of silicon as a function of process size

From this curve it is expected that a 45nm process will provide the order of 800 thousand gates per square millimetre and a 22nm process 3.2 million gates per square millimetre. Taking an arbitrary existing device (Pentium i7‐950) a sanity check can be performed. This device utilises 45nm technology, has 731 million transistors on a die size of 263mm2. Manipulating these numbers reveals the device has 2.8 million transistors per square millimetre. Typically a gate comprises of 4 transistors which provides a result of 700 thousand gates per square millimetre.

The process size also determines the performance of the ASIC in terms of propagation delays and thermal dissipation. Data from an IBM product brief has been extracted and plotted for gate delay, dynamic power and leakage current and is presented below in Figure 12, Figure 13 and Figure 14.

(http://www.em.avnet.com/ctf_shared/sta/df2df2usa/ASIC‐services‐.pdf )

The brief of includes processes down to 45nm but the plots, where possible, utilise trend lines to project performances down to 22nm technology.

Figure 12 IBM ASIC Gate Delays 2011‐02‐27 Page 32 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

The gate delay represents the latency through individual gates and coupled with propagation delay determines how fast sequential logic could theoretically run on the device. In reality, the speed is likely to be governed by the achievable thermal density of the device.

The thermal dissipation internal to the device can be considered to be made up of two components:

 Dynamic

 Static

The dynamic power is the work done in switching the internal transistors in relation to the internal resistances and parasitic capacitances within the device. Chandrakasan and Brodersen 1996 have shown the dynamic power

1 2

Where CL is the capacitive load, VDD the supply voltage, f the clock frequency and α a variable with a value between 0.05 and 0.5 dependent on the type of circuit.

Internal to the device, scaling the technology reduces the capacitive, CL and VDD terms resulting in a reduction of dynamic power. Figure 13 shows the dynamic power in Watts per MHz per gate as a function of process size. For an IBM 65 nm device this is 4.5 nW/MHz/gate and provide up to 120 million gates. It is estimated a 22nm device will dissipate 2.4nW/MHz/gate and provide up to 1000 million gates.

Figure 13 IBM ASIC Dynamic Power

The scaling of technology has provided the impetus for many product evolutions but is beginning to become problematic as it has also scaled the thickness of the oxide layer used to insulate the gate from the semiconductor used in the CMOS process. This reduction in thickness has increased the leakage current which represents the static dissipation of the device to an extent where the static

2011‐02‐27 Page 33 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

dissipation was becoming the most significant dissipation of the device. Figure 14 illustrates the increase in leakage current per unit gate length as a function of process size.

Figure 14IBM ASIC Static Power

Recent advances in the material used for the gate insulation have resulted in significant improvements in the leakage current. However, it is difficult to project how the material will improve for future process generations. An article, Leakage Current Meets Moore’s Law published in the IEEE Computer society http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1250885&tag=1 provides an excellent analysis of the subject including a speculative roadmap of thermal dissipation that is shown in Figure 15

Figure 15 Total chip dynamic and static power dissipation trends

2011‐02‐27 Page 34 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

This figure is based on the International Technology Roadmap for Semiconductors with The two power plots for static power represent the 2002 ITRS projections normalized to those for 2001. The dynamic power increase assumes a doubling of on‐chip devices every two years.

3.4.2 Masking Costs

Masking costs refer to the generation of the masks used as part of the photo lithographic process in generating each layer of the ASIC. These tend to increase with smaller feature size. Typical mask set costs are shown in Figure 16.

Figure 16 Mask Tooling Costs

These costs are approximate and will depend on the number of metal layers used and whether double poly or high resistance layers are used as part of the process. Historic data suggest that the cost of masks does not reduce with time.

From the curve, it can be seen there is a significant jump between 0.35 u and 0.25u. This corresponds to the increase in tooling costs as the limits of the technology (2007) are reached.

Tool costs also vary with feature size. For comparatively low feature sizes, electronic design automation (EDA) tools would be ~ $50,000 where as state of the art feature size would require more sophisticated tools capable of more detailed modelling costing several million dollars.

3.4.3 Yield and Die Costs

Manufacturing of an ASIC is achieved by producing multiple dies on a single wafer of silicon in the same way other integrated circuits, such as micro processors, are produced. Due to defects in the wafer or lithography process not all dies will function. The yield is highly dependent on the maturity

2011‐02‐27 Page 35 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

of the process and the area of the die which in turn affects the cost of the individual ASIC. The following equations detail the cost estimation:

2

The typical defects per unit area are of the order of 0.4/cm2 though this depends on the maturity of the process. This leads to the empirical relationship for the die yield:

1

Where α is a parameter that is a measure of manufacturing complexity and corresponds to the number of critical masking levels. Typically the value of α is 4.0 for a multilevel CMOS process.

The wafer yield can be assumed to be nominally 100% as very few wafers are completely unusable.

Looking at some typical figures:

 In quantity, an eight inch wafer costs of the order of $2000 and six inch wafers ~ $1000. Small batches may cost several times this.

 A 4mm sided die gives a yield higher than 90% and provides over 1500 good parts from an eight inch wafer resulting in a die cost of $1.3. The actual cost will be higher than this to take into account bonding pads, electrostatic protection devices, space between die for saw lines and power distribution. The core area of the die might only occupy 65% of the total space on the wafer though small designs are less efficient.

 The NRE for the production of a wafer includes more than the mask cost as there is likely to be a data preparation charge ~ $1000. This allows for data preparation including process control monitors on the silicon. In addition, a design rule check by the fabrication plant may cost a few thousand dollars.

 Typical production packaging for a device is of the order of 1 cent per pin. There is also a set up fee for the printing of details on the package such as part and batch numbers.

 Provide Test house a simulation file. Initial set up ~ $10,000. Tests cost ~ 3 – 10 cents per second which can be an appreciable cost of the device

2011‐02‐27 Page 36 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Minimum production run for a fabrication plant is a boat (25 wafers) so it is advisable to keep production runs to multiples of 25 wafers.

Wafer pricing can vary by a factor of two over the range of 25 to 500 wafers per month.

Typically the time to the first chip will take 12 to 18 months after the start of a new design and subsequent revisions within 1 to 2 months.

3.4.4 Prototyping

Multi‐project wafer (MPW): designs from multiple customers shared on one mask set with the mask costs shared. Long lead time and small silicon area. Cost ~ $5k ‐ $60k depending on process. MPW available through prototyping services such as MOSIS and Europractice in the United States and Europe respectively. The current top end process capability from these services is 65nm. It is estimated that 22 nm will be available via these prototyping services by 2016.

Multilayer Mask (MLM): Four mask layers can be accommodated on a single mask. Cost is cheaper than a full mask set. Turn around quicker than MPW.

Dealing with a fabrication or prototyping service can be problematic as there is an expectation that customers are familiar with the fabrication design rules and processes. In particular, submitting a job to MOSIS is via web based forms. Consequently, the use of an intermediate design house is often useful. The EVLA project, for example, has built up a successful working relationship with the design house iSine ASIC services in Boston:

http://www.isine.com/

3.5 Gap between FPGAs and ASICS

ASIC implementation has always provided a more efficient implementation than FPGAs in the context of silicon area used, speed and power consumption. However, FPGAs offer more flexibility and potentially faster and cheaper development through their re‐configurability. Consequently, it is worth while looking at the capability gap between the two technologies. Table 7 provides the summary of Kuon and Rose’s analysis presented in the book “Quantifying and Exploring the Gap between FPGA’s and ASICs (2009).

Metric Logic only Logic & DSP Logic & memory Logic, DSP & memory

Area 35 25

Performance 3.4 – 4.6 3.4 ‐4.6 3.5 – 4.8 3.0 – 4.1

Dynamic power 14 12 14 7.1

Table 7 FPGA to ASIC Gap Summary

The figures for this table are derived by a systematic analysis of many commonly used functions using logic, memory and DSP capability to provide implementation details on area, performance,

2011‐02‐27 Page 37 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

dynamic power and static power. The data for each of these common functions is then averaged to provide the figures presented above. A figure representing the effective gap is then derived:

Effective gap = Area gap x performance gap

For logic only implementations this is 3.4 x 35 =119 and Logic plus DSP 25 x 3.4 = 85

It is the size of this gap that prevents FPGAs being used in cost‐sensitive markets with high performance requirements.

If a full custom ASIC is considered (as opposed to the standard cell ASICs considered so far) then the FPGA is potentially ~ 500 times larger, 10 times slower and 42 times more power hungry.

3.5.1 Theoretical Processing Performance

Based on information presented in the multipliers and dividers section of Douglas J Smith’s HDL Chip Design book, it is estimated that a reasonably optimised 4 bit multiplier accumulator can be constructed from ~ 500 gates.

For 65nm technology the order of 400,000 gates can be implemented per millimetre square of silicon which corresponds to 800 off 4 bit integer multiply accumulators. Using 22nm technology this increases to 6400 multiplier accumulators per square millimetre.

The processing power of these multipliers will depend on how fast they can be clocked which in turn will determine the thermal dissipation for the device.

Taking a 4mm x 4mm die using 22nm technology an ASIC will provide a processing power of

_ 6400

Taking a reasonable but arbitrary clock rate for the ASIC of 400 MHz and assuming 16 mm2 area for the multipliers provides the following performance:

_ 400 16 6400 40

3.5.2 Cost

Section 3.4.3 provides details of the top level cost model for ASIC production showing that the die cost is of the order of $1.2 dollars per device.

The packaging is more expensive at ~ 1 cent per pin. It is expected that the number of pins will be high to deal with the high bandwidths of data that need to flow through the device. Typically, each pin is provably limited to signals of fewer than 10GHz and will require at least one or possibly two associated ground pins to maintain signal integrity. Manufacturer’s top end ball grid array packaging can provide up to ~ 2000 pins which equates to a packaging cost of $20. MCM packaging may offer a cheaper alternative.

The amount of testing required for each device and expected test yield are not yet known. Due to the likelihood of the inclusion of memory in the device the test time is estimated to be quite high

2011‐02‐27 Page 38 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

and is guesstimated at 5 to 10 seconds. This would put the testing of each 22nm device in the region of $0.5.

3.5.3 Thermal Dissipation

A first estimate for the thermal dissipation for the ASIC can be determined from the dynamic power characteristic which is nominally 2.4nW/MHz/gate for a 22nm device though this does not include the interconnectivity between gates.

_ .

The number of gates switching during any one multiply is dependent on the characteristics of the input signals. As these are Gaussian, it is a fair assumption that less than half the data bits will be toggling.

_ 400 6400 500 16 2.4 10 2 25

3.6 Network on Chip, NoC

Network‐on‐Chip, NoC is an approach to designing the communication subsystem between blocks within the same silicon chip by applying networking theory and methods. This provides notable improvements over conventional bus and crossbar interconnections with respect to scalability and power efficiency. A key aspect of implementing a network on chip is the ability to support a high level of modularity that facilitates scalability. For example, processing cores, memory and I/O modules can be replicated within a design with the NoC providing the communication infrastructure.

Typically a NoC design will utilise mesochronous communication which means the communication nodes within the network will utilise clocks running at the same frequency but unknown phases. The phase differences are due to asymmetric clock tree design and differences in load capacitance of leaf cells. To avoid meta‐stability issues, synchronisers are used between clock domains and may be implemented using delay line or pipeline synchronisers [19], [20]

The NoC is then used to provide the communication fabric between processor cores, memory and input/output blocks etc. A typical implementation is the development of Intel’s multicore GPU Larabee described in [21] . This paper describes how 80 processor cores in a 10 x 8 2D matrix are coupled via a NoC as illustrated in Figure 17. A fully non‐blocking, cross bar, switch is associated with each processing engine. These switches are then interconnected in a 2D mesh topology with each cross connect made up of two logical communication paths.

2011‐02‐27 Page 39 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 17 Example NoC and processing Tile

Each communication path implements its own flow control, arbitration and queuing. The resultant architecture translates to a regular matrix on silicon as illustrated in

Figure 18 Silicon Implementation

An individual switching unit roughly translates to 0.34mm2 of the silicon area for the dual 36 bit transmission on 65nm silicon. This roughly corresponds to 140,000 gates. The estimated thermal dissipation associated with distributing the global mesochronous clock is 2.2 Watts assuming the clock is running at 4 GHz off a 1.2 V supply rail. The network achieves a bi‐section bandwidth of 256 G Bytes/s.

2011‐02‐27 Page 40 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Low‐Power NoC for High‐Performance SoC Design, [21] provides an analysis methodology to evaluate multiple flat and hierarchical topologies. The methodology utilises am energy efficiency

metric, Epkt representing the average packet traversal energy based on the number of switching hops, links and destination buffering.

Where:

Havg and Lavg are average hop counts and average distance between switch nodes.

SSavg is the number of I/O ports per switch

Equeue, Earb and Elink are the energies expended in a single communication for the queuing, arbitration and transmission energy for the link.

Example values for packet transmission energies are provided in [19] and given in Table 8. This table also provides values project to 65nm and 22nm technology. In this case the transmitted packet is assumed to consist of 32 bit data and 32 bit address plus 16 bit header field up sampled to 1.6 GHz for serial transmission. For 65nm and 22nm the link speed could potentially be 4 and 8 times faster respectively.

Description Energy (J) Projected Projected Symbol per 1 packet Value 65nm Value 22nm traversal

180nm

‐10 ‐11 ‐11 Buffer 1.97 x 10 4.4 x 10 2.5 x 10 Equeue (write/read)

‐12 ‐12 ‐12 Switching 6.25 x 10 1.4 x 10 7.8 x 10 ESF fabric/ port

‐12 ‐13 ‐13 2:1 3.04 x 10 6.8 x 10 3.8 x 10 Emux multiplexer

‐13 ‐14 ‐14 Arbitration/ 1.79 x 10 4.0 x 10 2.2 x 10 Earb port

‐11 ‐12 ‐12 1‐mm link 4.38 x 10 9.9 x 10 5.5 x 10 Elink

‐11 ‐11 ‐11 1‐mm (P to P) 8.76 x 10 2.0 x 10 1.1 x 10 Elink_PtP

Table 8 NoC Packet transmission Energies

Research has been done on integrated optical waveguides and devices comprising an Optical Network‐on‐Chip (ONoC). IBM have a concept for a three dimensional silicon processing chip that

2011‐02‐27 Page 41 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

will include an in built photonic network layer that provides optical routing between processing cores and memory blocks as illustrated in Figure 19 and further details are available at:

http://domino.research.ibm.com/comm/research_projects.nsf/pages/photonics.index.html

Figure 19 Artist’ concept of 3D silicon processor chip with optical IO layer featuring on‐chip nanophotonic network

4 Storage Storage is a major technology driver for the signal processing aspects of the SKA telescope as the buffering and time alignment of data bandwidths requires a significant amount of storage capability. The most notable use of large amounts of memory is associated with the delay compensation buffer and output buffers for correlation products. However, memory usage is throughout the signal processing subsystem including embedded registers in any processing engine through to memory buffers used for implementing corner turns of data.

The term storage has been used to cover the different classes of storage that are available or are being developed within the commercial sector. Figure 20 provides a representative taxonomy of the storage types that are likely to be available within the time frame of SKA (2015/ 2016). This diagram isn’t a complete set but hopefully details the main contenders.

2011‐02‐27 Page 42 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

bdd [block] system [Storage Technology definitions]

«block» Storage

«block» «block» «block» Volatile Non-Volatile Bulk Storage Memory Memory

«block» «block» «block» «block» «block» «block» «block» Tape Hard Disk SRAM DRAM NAND Flash SCM NOR Flash

«block» Semiconductor Disk

«block» «block» «block» MRAM RRAM PRAM

Figure 20 Storage taxonomy

In general, storage technologies have developed to support the hierarchal model utilised for general purpose computing as detailed in Figure 21. The hierarchy has limited amounts of fast storage close to the processor at the top of the hierarchy and large quantities of slower but cheaper storage at the bottom. The figure details relative performance in terms of the number of CPU cycles to access data.

A long recognised problem with the storage hierarchy of Figure 21 is the bandwidth performance gap of over a factor of 100 between DRAM technology and Hard Disk. Solid State Drives are beginning to address this gap and may at some stage replace Hard disk technology completely. However, within the time frame of SKA1, it is believed that Hard disk technology will still be the technology of choice for the high end market.

Although tape storage is included in the on the diagrams, it is not directly applicable to signal processing and as such isn’t covered any further within this document. However, it is a technology likely to be applicable to the Science Computing domain. Hard disks, Solid State Drive and their hybrids are explored as they may be of some utility in the Non Image Computing domain where traditionally raw “voltage” data is captured and analysed off line though, for the SKA, the volume of data involved and cost of storage is likely to severely limit the export of raw voltage data products.

2011‐02‐27 Page 43 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

SRAM

DRAM

SCM

Hard Disk

Figure 21 Storage Hierarchy

Currently Samsung are the market leaders in semiconductor memory and in particular Flash Memory. Their historic roadmap of the development of memory device development of bit density per device is presented in Figure 22.

Figure 22 Samsung’s Memory Technology and Solutions Roadmap

http://www.samsung.com/us/aboutsamsung/ir/ireventpresentations/analystday/downloads/analys t_20051104_0800.pdf

This shows how NAND Flash has become the dominant semiconductor technology for bulk storage having passed DRAM in storage density in 2002. This capability has manifested itself in the proliferation of storage devices such as memory sticks and more recently SSD drives.

2011‐02‐27 Page 44 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Table 9 Current Baseline and Prototypical Memory Technologies (ITRS 2007)1

The following sections take a closer look at the characteristics and roadmap of the storage technologies in order of decreasing bandwidth performance.

4.1 SRAM

SRAM is the highest performance storage technology detailed in the storage hierarchy of Figure 21. The price that is paid for this high performance is comparatively high thermal dissipation. Additionally, an individual memory cell is large typically requiring 6 to 8 transistors per bit of storage. Most SRAM cells have a silicon area that is in the range of 140‐150F2 (where F is the smallest lithographic dimension). The SRAM cell is a bi‐stable latch and requires power to be maintained in order for the cell contents to remain valid. In addition, SRAM cells are subject to radiation‐induced failures that affect their soft error rate (SER), and must be carefully designed with additional ECC bits and a layout that ensures that an SER event does not affect multiple bits in the same data word.

1 2009 ITRS figures now available this table is to be updated at next issue 2011‐02‐27 Page 45 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

SRAM cells may be designed for low power or for high performance. The memories used in CPU caches obviously take the later approach and are thus substantial consumers of power. Approaches such as segmenting the power and reducing voltages to the portions of the array not being addressed can be used to help mitigate SRAM power consumption.

Hewlett Packard has an online tool that allows the selection of technology parameters such as feature size and packaging options that provides estimates of thermal dissipation:

http://quid.hpl.hp.com:9081/cacti/sram.y . This tool is also applicable to DRAM technology.

On the whole SRAM will be implemented either on ASIC or as an external COTS chip. The characterisation of the performance, thermal dissipation and cost presented below is for the latter. The Cypress CY7C1069AV33 device has been used as an arbitrary representation of the technology that also has traceability to a costing website.

4.1.1 SRAM performance

The amount of SRAM available on commercial memory chips is limited due to the size of individual bit cells. For the Cypress SRAM memory chip range 16 M bits is the largest device. The CY7C1069AV33 is organised as 2M x 8 bit and has a speed of 10ns.

4.1.2 SRAM Thermal Dissipation

The data sheet for the CY7C1069AV33 claims an active dissipation of under 990mW based on 150nm technology. Assuming that this corresponds to a maximum continuous read or write cycle time of 10ns, 8 bits can be read or written at a rate of 100MHz for this dissipation. This corresponds to the order of 800 M bit/s/W.

4.1.3 SRAM Cost

On line pricing for the CY7C1069AV33 and other devices is available at:

http://www.cypress.com/?id=87&addcols=¶metric=html&filter_184=2Mb+x+8#parametric

The one off costs extracted from this site in Feb 2011:

 $44 for commercial grade 16384 k bits which is equivalent to $2.7 per M bit

 $67 for industrial grade $4.1 per M bit

4.2 Dynamic Random Access Memory, DRAM

Double Data Rate, DDR, has become the main standard for DRAM implementation and is currently in its third generation DDR3. Figure 23 and Figure 24 provide a historic perspective of the DDR generations and their production.

2011‐02‐27 Page 46 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 23 Samsung’s DRAM Historic Roadmap

Figure 24 Samsung’s DRAM Historic Roadmap

From these curves it can be seen:

 Each technology generation has a three year period

 Each generation doubles the storage capacity per device

 The number of units being shipped is increasing

4.2.1 DRAM Performance

Samsung have published their proposed roadmap for DDR4 DRAM which is likely to reach peak production in 2015. Part of the roadmap includes a historic roadmap of bandwidth performance for DDR technology up to and including DDR4 in 2012 which is provided in Figure 25

2011‐02‐27 Page 47 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 25 Samsung DDR DRAM Performance Roadmap

http://www.samsung.com/global/business/semiconductor/Greenmemory/Products/DDR3/DDR3_O verview.html

It is expected that DDR4 will initially clock at 2.133 GHz (~32 GB/s for 16 bit data) and it will scale up to 4.2 GHz by 2015.

In January 2011 Samsung announced the completion of the development of the world’s first DDR4 product:

http://www.samsung.com/global/business/semiconductor/newsView.do?news_id=1228

4.2.2 DRAM Cost

According to chip market watcher iSuppli, the recent history of the selling price of 1 GBy DRAM is shown in Figure 26. An interesting aspect of this curve is the fact that DRAM prices have slumped over the last 12 months leading up to December 2010. This is likely to have been driven by market forces relating to improved production yield verses market demand. It should be pointed out these figures only relate to 1 and GB devices and as such does not present a complete picture of the DRAM market. For example, servers typically use 4 GB DDR3 modules or possibly 8 GB and 16 GB DDR3 modules for top end applications.

2011‐02‐27 Page 48 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 26 DRAM Chip Selling Price December 2010

http://www.isuppli.com/Memory‐and‐Storage/News/Pages/DRAM‐Pricing‐Collapse‐Continues‐in‐ December.aspx

Detailed contract and spot DRAM prices and history as well as market intelligence are available at http://www.dramexchange.com/. Prices as of Jan 10th 2011:

 4GB 1066 MHz SO‐DIM DDR3 device is listed at $34 suggesting a price of $8.5 per GB

 2GB 1066 MHz SO‐DIM DDR3 device is listed at $34 suggesting a price of $17 per GB

Ignoring market fluctuations, it is assumed that DDR DRAM will nominally half with each new generation of technology over a three year time span. This would suggest DRAM will be of the order of $2 to $4 per GB in 2015/16.

4.2.3 DRAM Thermal Dissipation

Figures are available for the measured thermal dissipation of Samsung’s DRAM in a server environment and are presented in diagrammatic form in

Figure 27 Samsung DRAM: Measured Thermal Dissipation

2011‐02‐27 Page 49 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

http://www.samsung.com/global/business/semiconductor/Greenmemory/Applications/ServerStora ge/ServerStorage_DDR3.html

Taking the 30nm‐class 4Gb 1.35V DDR3 technology implies that the thermal dissipation is 350mW per Giga‐bit.

The clock rate for DDR4 is likely to be of the order of 4.2 GHz by 2015, however, the operating voltage is only likely to drop to 1.05V. Consequently DDR4 dissipation per package may be up and the thermal dissipation per Giga‐bit 260mW. A brief discussion of the estimated dissipation is available at:

http://www.bit‐tech.net/hardware/memory/2010/08/26/ddr4‐what‐we‐can‐expect/1

4.3 Flash Memory

Flash memory is an electrically erasable, non‐volatile, semiconductor memory that has seen considerable advances in recent years due its development for use in the high volume consumer market including mobile phones cameras and portable MP3 players. Flash memory is the pre‐cursor to Storage Centric Memory, SCM that may eventually replace the Hard disk completely by offering lower power and improved bandwidth storage.

Currently there are two types of flash technology implementations providing memory by the storage of charge: NAND and NOR. The storage cell structure and schematic for these are illustrated in Figure 28

Figure 28 NAND and NOR Flash Memory Schematics and Cell layout

2011‐02‐27 Page 50 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

The difference to the connectivity and architecture result in differing performance characteristics. The NAND cell is organised in small blocks with a single bit lines (BL in the diagram) feeding the block. Data bits are passed across storage elements to their appropriate location in a serial shift. This results in a high packing density of typically 2 F2 per bit (where F is the smallest lithographic dimension). New multi‐level cell architectures promise to offer 4 bits storage in the same space. Although the block architecture offers high storage density, there is a price to pay with respect to achievable read and write data rates: currently read 25 MB/s write 8 MB/s. Access times are currently of the order of 25us though data blocks can be read faster.

NOR architecture has each bit connected to its respective bit line resulting in larger cell size of the order 10 F2 and substantially higher read data rates than achievable with NAND with a capability of up to 100 MB/s. However write and erase performance are limited (due to the mechanism used to store the charge) to less than 0.5 MB/s. As a consequence NOR implementations tend to be limited to storage for program code.

The concept of charge storage for memory has some implications with respect to the future roadmap of Flash. As feature sizes shrink to accommodate greater storage densities and improved dissipation, the amount of charge that can be stored within an individual cell also shrinks and the potential for charge leakage increases. This imposes limits on scaling as the oxide layer in the transistor needs to be greater than 7nm to ensure data retention. However, the industry consensus is that Flash can scale to at least 22nm. Figure 29 shows the historic road map from Intel and Micron showing the feature size as a function of time up to Q4 of 2009 and illustrates how close Flash technology is to reaching the 22nm feature size.

Figure 29 Intel Micron Historic Flash Roadmap

2011‐02‐27 Page 51 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

4.3.1 NAND Cost

Figure 30 NAND Cost per M Byte Road Map

4.3.2 NAND Thermal Dissipation

Void

4.4 Storage Class Memory

Storage Class Memory is a term that applies to a range of memory technologies in development with the ultimate aim of replacing Hard Disks with cheap solid state non‐volatile RAM.

Currently, there are several technologies in development with the aim of satisfying these goals including but not limited to:

 Ferroelectric

 Magnetic

 Phase‐change

 Resistive RAM

 Organic

 Polymeric

2011‐02‐27 Page 52 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Of these technologies, Phase‐change and Resistive memory are currently regarded as offering the most promise of providing a solution for storage class memory. The following papers provide an excellent perspective of the issues that need to be resolved for each candidate technology type and an overview of the strategy of moving towards a SCM solution:

 Storage‐class memory: The Next Storage System Technology R. F. Freitas and. W. Wilcke

 Overview of Candidate Device Technologies for Storage‐Class Memory: G. W. Burr, B. N. Kurdi, J. C. Scott, C. H. Lam, K. Gopalakrishnan and R. S. Shenoy

4.4.1 SCM Performance

Performance goals have been identified for a SCM device if it is to offer a viable alternative to Hard disk and possibly DRAM technology:

Capacity > 1TB

Rd/Wr access time < 100ns

Bandwidth > 1 G By/s

Transaction rate > 238,000 transactions/s

Number of reads/writes > 108 to 1012 times

The number of reads and writes allows for the possibility of wear levelling techniques. To put this in perspective Flash can be written to 104 to 105 times, DRAM 1015 , and Hard disk 1012 times before encountering a degradation of data storage reliability.

2011‐02‐27 Page 53 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

4.4.2 SCM Cost

Figure 31 SCM Roadmap in relation to NAND, DRAM and Hard Disk (HDD)

http://www.gsaglobal.org/events/2010/0316/docs/7.GMC‐PierreFazan.pdf

Figure 31shows the project price per Giga Byte of SCM as a function of time. This suggests that the first SCM technology should be emerging in 2010/11 at 50 cents per Giga Byte reducing to 4 or 5 cents per G Byte by 2015/16.

4.4.3 SCM Thermal Dissipation

Void

5 Disk storage Hard disks are at the bottom of the storage hierarchy detailed in Figure 21. They offer a high storage density but suffer from some fundamental problems that limit their performance in terms of data rate and access time. Figure 21shows that disk access is of the order of 107 to 108 times slower than SRAM. Sophisticated caching schemes have been designed to cover up this difference in performance. The storage density of a disk is known as the areal density and is defined by the physical characteristics of the disk:

2011‐02‐27 Page 54 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 32 Historic Roadmap for Disk Areal Density

Figure 32 details how the areal density of disks has developed of their history from the mid 1950s. This growth is impressive and has resulted in 3 T By disks being commercially available by 2010. The curve of this graph varies with time with average growth rate varying between 25% in the 1980’s to 60 – 100% through the 1990s.

5.1.1 Disk Performance

Figure 33 Historic Roadmap for Disk Bandwidth

Figure 33 shows the historic bandwidth performance for disks. During the 1990s, bandwidth was improving at a rate of 40% per annum but by 2002 fundamental limitations of disk storage began to surface. The internal data rate of the disk is limited by how quickly the rotating platter of the disk can pass by the read or write heads. The maximum Internal Data rate, IDR, in M By/s is given by

512 60 1024 1024

2011‐02‐27 Page 55 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Where ntz0 is the number of sectors per track at the outer edge of the disk (zone 0) and rpm is the number of rotations per minute of the disk. Tracks are grouped into zones based on their distance from the centre of the disk, and each zone is assigned a number of sectors per track. The number of sectors per track increases from the centre outwards and allows for more efficient use of the larger tracks on the outside of the disk. Consequently the highest data rate is achieved in zone 0.

High performance disks tend to have a high rpm which for current implementations is 15,000. However, high rotation speeds mean higher mechanical load on the disk bearings which results in higher thermal dissipation. For this reason, high performance disks tend to be limited to a low number of platters and small diameter. This results in a lower disk capacity.

Typically, today, a high performance disk can achieve a sustained bandwidth of the order of 300 M Bytes/s. Projecting to 2015/16 with the annual improvement rate of 40% per annum suggests individual disks might achieve a sustained bandwidth of the order of 2 to 3 G Bytes/s.

5.1.2 Disk Thermal Dissipation

Thermal dissipation and the associated temperature rises are a limiting factor for disk drive performance. For example, a 15oC rise in ambient temperature can double the failure rate. (Anderson, Dykes and Reidel: More than an interface SCSI vs ATA, Proceedings of the Annual Conference on File and Storage Technology March 2003)

Hennessy and Patterson provide figures for a typical ATA drive in 2006 as:

 Idle: 9 Watts

 Reading or Writing: 11 Watts

 Seek: 13 Watts

Gurumuthi and Sivasubramaniam: Disk Drive Roadmap from the Thermal Perspective: A case for Thermal Management, Proceedings of the 32nd International Symposium on Computer Architecture (ISCA ’05) provide empirical relationships for thermal dissipation:

. . 25.4 1000 60

Unfortunately the paper does not provide the constant of proportionality for this relationship. However, from the relationship it can be seen that the Diameter of the disk platters and the disk speed in rpm have a major impact on thermal dissipation. For this reason, it is expected that improvements in areal density will be traded against disk platter size in the near future.

5.1.3 Disk Cost

Edward Walker’s (NRAO) paper ‘To lease or not to lease from storage clouds’ provides a cost model

for the price per Giga byte, GT, of a disk T years in the future which currently costs K dollars per G Byte

2011‐02‐27 Page 56 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

.

In January 2011 the price per Giga Byte of disk storage is of the order of $ 0 .039 is

Projecting to 2015/16 this is expected to be in the region of $0.005

6 Network The network provides the communications infrastructure within the SKA telescope that not only includes the communication paths for the data stream but also includes the Monitoring and Control communication paths. This document limits the scope of technology to that used within the signal processing domain and does not include the technology for the receptor to central processing and central processing to science computing links. The main implication of this is the consideration of transmission distances that are likely to be limited to 100 metres or less.

Currently the main contenders with publicised roadmaps for providing a commercial solution are Ethernet and Infiniband.

6.1 Infiniband

The Infiniband Trade Association (IBTA), http://www.infinibandta.org/ , defines InfiniBand as an industry‐standard specification that defines an input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems. InfiniBand is a true fabric architecture that leverages switched, point‐to‐point channels with data transfers today at up to 120 gigabits per second, both in chassis backplane applications as well as through external copper and optical fibre connections.

6.1.1 Infiniband Performance Roadmap

A historic roadmap for Infiniband capability is provided at the IBTA web‐site and is shown in Figure 34.

2011‐02‐27 Page 57 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 34 Infiniband Roadmap

From this roadmap, it can be seen that the Enhanced Data Rate technology is due for release this year (2011). This will offer 20 G bit/s capability per lane with up to 12 lanes per interconnect (240 G bit/s)

High Data Rate, HDR, and Next Data Rate, NDR, technologies are identified for the future. Extrapolating the time line, it might be reasonable to expect HDR capability of 480 + 480 G bits/s by 2014/2015.

6.1.2 Host Channel Adapters

Host channel adapters, HCA, provide the line card functionality for the hosting computer and provides a bridge between the PCI/ PCI express interface of the computer to Infiniband. The HCA off loads the protocol stack from the hosting computer resulting in low latency communication. As an example product, Mellonox manufacture a dual 4x QSFP 40 Gbit/s (part number: HCA MHQH29C‐ XTR) that dissipates 8.8 Watts and retails for $892 unit price:

http://www.mellanox.com/related‐docs/prod_adapter_cards/ConnectX‐2_VPI_Card.pdf

http://www.provantage.com/mellanox‐technologies‐mhqh29c‐xtr~AMLNX17U.htm

6.1.3 Infiniband switches

Several blue chip manufacturers including but not limited to IBM, Cisco, HP, , Oracle and manufacture or utilise third party Infiniband switch products in some of their products. The main manufacturer of Infiniband silicon are Mellanox: http://www.mellanox.com/ . They also manufacture Host Bus Adaptors and Infiniband switches with the largest (IS 5600 Model) supporting up to 640 40 G bit/s ports in a 29U rack with up to 6.7 kWatts dissipation:

2011‐02‐27 Page 58 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=59&menu_sectio n=49

Sun Microsystems have what they claim is the world’s largest Infiniband switch at 3456 nodes. http://www.sun.com/products/networking/infiniband.jsp

Sun claim 41 T b/s bisectional bandwidth for less than 6.5 k Watts thermal dissipation which equates to 6.5 Watts per G bit/s

6.2 Ethernet

Ethernet is the ubiquitous communication protocol that has evolved and out survived most of its rivals. The latest incarnation is the 100 G bit standard IEEE P802.3ba which was ratified in June 2010. This standard has the following key characteristics:

 Ethernet frames at 40 and 100 gigabits per second over multiple 10 Gb/s or 25 Gb/s lanes

 Preserve the 802.3 / Ethernet frame format utilizing the 802.3 MAC

 Preserve minimum and maximum Frame Size of current 802.3 standard

 Support a bit error ratio (BER) better than or equal to 10−12 at the MAC/PLS service interface

 Provide appropriate support for OTN

 Support MAC data rates of 40 and 100 Gbit/s

 Provide Physical Layer specifications (PHY) for operation over single‐mode optical fibre (SMF), OM3 multi‐mode optical fibre (MMF), copper cable assembly, and backplane.

Several standards for the physical interface are defined and summarised in Figure 35

Figure 35 Ethernet PHY standards

Of principle interest for signal processing are the short haul interfaces of less than 100m and in particular the 100GBase‐CR10 with ten lanes of twin‐ax and 100GBase‐SR10 with 10 lanes of short reach multi‐mode fibre. The LR4 and ER4 interfaces are based on 4 x 25 G bit/s lanes.

2011‐02‐27 Page 59 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

6.2.1 100 G bit/s Ethernet Switches

At the time of writing (January 2010), the first switch products promising 100 G bit/s line cards are beginning to emerge. Alcatel Lucent already have silicon in the form of their FP2 chip which will be available as part of their 7450 switch and 7750 routers.

http://www.alcatel‐lucent.com/features/100GE/game_changing_ss.html

6.2.2 Terabit Ethernet

There is already some discussion of Terabit Ethernet capability. Facebook has already identified its need for the technology. There is already a website with some discussion of the subject that includes video of interviews with Bob Metcalf speculating that tera‐bit Ethernet will be commercially available by 2015.

http://www.terabit‐ethernet.com/

6.2.3 Ethernet Cost

In 2001, when 10 Gigabit Ethernet switches were introduced, the average per‐port cost was $39,000, according to IDC. In January 2009 this had reduced to under $4000. http://www.networkworld.com/supp/2009/outlook/hottech/010509‐nine‐hot‐techs‐10‐gig‐ ethernet.html

Today a dual port network interface card costs of the order of $700 suggesting $300 to $400 dollars per port.

Brocade has announced initial pricing for 100 G Ethernet at $100K per port:

http://www.networkworld.com/news/2010/091510‐brocade‐10g‐ethernet.html

6.2.4 Ethernet Thermal Dissipation

Alcatel Lucent have published a power consumption roadmap (see Figure 36) for Ethernet Line cards providing the number of Watts dissipated per Giga‐bit per second transmitted

2011‐02‐27 Page 60 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 36 Alcatel Lucent Power Consumption Roadmap

This graph suggests that 100 G bit/s line cards will dissipate just over 400 Watts at maximum bandwidth. This includes the dissipation for all components on the line card including data exchange memory, interface chips, MAC devices as well as the Ethernet physical device.

The CFP Module standard specifies the physical device as part of the Multi Source Agreement for plug in modules:

http://www.cfp‐msa.org/

As part of the standard a set of thermal dissipations are specified via a hardware interlock mechanism as detailed in Figure 37.

Figure 37 CFP Hardware Specification Power Interlock

The maximum power class suggests that CFP modules will dissipate less than 32W .

6.3 Optical Interconnect

Products are beginning to emerge that are pushing the boundaries of where optical interconnection can be used. It is common practice to provide high bandwidth interconnections between individual racks of equipment. However, the potential to interconnect optically through backplanes and

2011‐02‐27 Page 61 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

eventually to individual chips is becoming a reality through the use of optical wave guides and lens arrays. The provision of optical interconnectivity helps mitigate the traditional problems of thermal dissipation and signal integrity issues associate with electrical connectivity.

The most prominent coverage of optical connection development in recent electronics press and journals is development of the Terra Bus. For this reason this document focuses on IBM technology to provide an over view.

The provisional time line suggested by IBM for the introduction of the technology is:

 Today: rack to rack conventional optical modules and edge of card packaging

 2011: Dense parallel fibre coupled modules close to the CPU

 2015: Integrated transceivers and optical printed circuit boards

 2020: 3‐D stack processing chips with transceivers integrated into processors.

http://www03.ibm.com/procurement/proweb.nsf/objectdocswebview/filepcb+‐ +ibm+opcb+roadmap+and+technology+‐+jeff+kash.pdf/$file/ibm+opcb+roadmap+and+tech+‐ +jeff+kash.pdf

IBM has already created a prototype printed circuit board that interconnects two modules using a polymer waveguide. The overall concept is illustrated in Figure 38

Figure 38 IBM Terra Bus Overview

The modules a waveguide lens array that provides a mechanism for interfacing the optical signal in and out of the waveguide. Detail of the optical lens array implementation in relation to the overall module and printed circuit board is shown in Figure 39 and Figure 40 respectively.

2011‐02‐27 Page 62 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 39 IBM Terrabus Integrated Circuit Connectivity

Figure 40 IBM Terrabus Integrated Circuit and Printed Circuit board Optical Connectivity

Within the module vertical‐cavity surface –emitting laser, VCSEL optical transmitters and photo detector devices provide the optical transmit and receive capability respectively. It has been found that at 850 nm wavelength there is less loss in the polymer waveguide and will currently support communication over links of up to 1 metre. Industry is already producing VCSEL devices at 850nm in high volume including multi‐channel devices. Emcore have 12 channel devices on their web‐site http://www.emcore.com/fiber_optics/transceivers/12_channel_parallel that offer up to 5 G bps per lane and 24 channels. IBM is using a special 24 channel variant of this device with 15 G bit/s capability per lane.

As illustrated in Figure 39, a silicon carrier chip is used to host the components of the module. Currently, IBM’s terrabus is using a carrier with dimensions of 10.4 x 6,4 mm. The laser & photo diode arrays plus CMOS Tx and Rx components are soldered to carrier.

6.3.1 Performance

Current performance demonstrated by IBM at the SC07 show was 10 Gb/s along a 150mm bus through utilising 32 way links operating at 985 nm. However, subsequent research has shown that 850 nm wavelengths are more optimal for transmitting through the waveguides and that data rates of 360 Gb/s for up to 1 metre have been achieved. The current bandwidth density is 9 G b/s/mm2 .

2011‐02‐27 Page 63 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

6.3.2 Thermal Dissipation

The current thermal dissipation for 24 channels each operating at 15 G bits/s bidirectional rate of 360 G b/s is 2.3 W. This equates to a power efficiency of 6.5 pJ per link

6.3.3 Cost

There are currently no cost estimates available.

7 Appendix 1 7.1 Moore’s Law

According to Wikipedia Moore’s law is based on Gordon Moore’s paper of 1965 that noted that number of components in integrated circuits had doubled every year from the invention of the integrated circuit in 1958 until 1965 and predicted that the trend would continue "for at least ten years". His prediction has proved to be uncannily accurate, in part because the law is now used in the to guide long‐term planning and to set targets for research and development. In general it is accepted that Moore’s Law is an observation that the number of transistors within a device doubles every 18 months (see Figure 41) with current (2011) high end devices having hundreds of millions of transistors.

Steve Timberger Xilinx

Figure 41 Numbers of Transistors for Intel Processors

The ability to include more transistors per chip has the bonus that the cost per transistor is also exponentially decreasing as detailed in the ITRS roadmap (Figure 42)

2011‐02‐27 Page 64 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 42 ITRS transistor cost predictions

Several parameters are closely linked to Moore’s law and the number of transistors per device and are presented in Table 10

Parameter Current Value Yearly Factor Years to Double (Half)

Moore’s Law (grids on a die)** 1B 1.49 1.75

Gate Delay 150ps 0.87 (5)

Capability (grids / gate delay) 1.71 1.3

Device‐length wire delay 1.00

Die‐length wire delay / gate 1.71 1.3 delay

Pins per package 750 1.11 7

Aggregate off‐chip bandwidth 1.28 3

From Digital Systems Engineering, Dally and Poulton, 1998

** Ignores multi‐layer metal, 8‐layers in 2001

Table 10 Semiconductor parameter growth

This table shows that the number of transistors per device doubling every 18 months isn’t the full story and that there are some parameters that are growing considerably slower rate such as pins per package which takes 7 years to double and gate delay which takes 5 years to half.

2011‐02‐27 Page 65 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

7.2 Transistor Size

The ability to increase the number of transistors on a device relies on the ability to reduce transistor size and the ability to increase silicon area. A historical roadmap including projections to beyond 2020 are shown in Figure 43 which speculates on the existence of 4nm feature size at the extreme end of the projection

Michael Keating SNUG 2010

Figure 43 Roadmap of Transistor Size

Taking a scaling factor of α for the feature size shrinkage, the behaviour of other aspects of the device can be derived and are shown in Table 11. How these parameters relate to the physical implementation of the device are illustrated in

Scaling Results

Voltage V/α Higher Density α2

Oxide tox/α Higher Speed α

Wire Width W/α Power/ ckt 1/α2

Gate Width L/α Power Density Constant

Diffusion Xd/α

Substrate Α * NA

Table 11 Device Scaling factors

2011‐02‐27 Page 66 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 44 Physical Scaling of Parameters for a Semi‐conductor gate

The terms in Table 11 are hopefully self explanatory. However it is worth pointing out that the device density in terms of transistors per unit area increases and the power per circuit decreases by a factor proportional to the square of the feature size reduction. A negative aspect of this scaling is

that the oxide thickness tox decreases proportionally with the feature size reduction. This means more static leakage current across transistor gates. Improvements in the oxide material have improved the situation by the use of hafnium oxide materials.

7.3 Breaking Moore’s Law

There are a number of issues, which may lead to a breakage of Moore’s Law most notably design complexity:

 Lithography ‐ reduced dimensions makes mask production very difficult  Process technology complexity and maintaining yield  Length of interconnects on chip, leading to increasing propagation delays and parasitic capacitance  Reduced gate oxide thickness below 1 nm, leading to fluctuations in doping profiles (100 atoms long gate length, less than 100 dopant atoms)

As well as technical issues, there are significant economic factors in device production. Effectively a corollary to Moore’s Law is Rock’s Law, which states that the tooling cost for semiconductor die manufacture doubles every two years. This is far in excess of inflation, which halves the value of money every decade. Consequently, the cost of manufacture may become the limiting factor.

7.4 Moore’s Law and Processing Capability

Moore’s Law relates to the number of transistors on a chip, which does not necessarily reflect directly on processing power. Not all transistors within a processor chip are directly associated with processing. Within a RISC processor, in particular, elaborate multi‐level cache, or branch look‐ ahead, and queue mechanisms are implemented, occupying a substantial area of the device real estate.

2011‐02‐27 Page 67 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Prior to the advent of the RISC processor in 1985, processing performance grew at a rate of 135% per year. Subsequent processing performance growth has been averaging 160% per year. However, this growth has been at the expense of ever‐increasing inefficiencies in the use of the silicon area (and larger number of external connections).

As illustrated earlier, the dissipation expected from future processors increases exponentially with respect to time. Reductions in operating voltage only slow the rate of increased dissipation by the order of 30%. It should be noted, in passing, that the dissipation of a processor varies with application it is running, depending on how many, and at what rate, transistors are being switched by that application.

A summary of historic scaling factors:

per chip has doubled every 18 months for over 40 years

 Feature size has reduced by 30% every 2 to 3 years

 Until recently speed increased by 30% per year

 Function cost has reduced by ~ 25 to 30 % per year

 Scaling is expected to continue until transistor gate lengths are ~ 10nm

Issues that may limit scaling:

 Sub‐threshold current (off‐current) doesn’t scale

 Electron tunnelling increases with small dimensions

 Doping variations cause large threshold voltage variations

 Power density as a result of leakage current increases more rapidly than dynamic power

8 Appendix 2 This appendix identifies other potentially interesting boutique technologies. Documentation on these is kept at a minimum on purpose. On the whole, these technologies are higher risk as, at present, they don’t necessarily represent the mainstream and on the whole will not be second sourced .

8.1 Tilera

Tilera are a Californian based company with representation in China Japan and Korea that are producing a processing chip with 16 to 100 identical processor cores (tiles) interconnected with an on on‐chip network. Each tile consists of a processor with L1 and L2 cache plus a non‐blocking switch that connects the tiles into the mesh. Each tile can independently run a full operating system, or a group of multiple tiles can run a multi‐processing OS, like SMP Linux. This architecture is now in its third generation with the TILE‐Gx family offering 100, 64, 36, and 16‐core versions as detailed in Figure 45.

2011‐02‐27 Page 68 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

Figure 45 Tilera Tile Processor architecture

The initial market for the Tile architecture was aimed at modelling climate conditions. However the company are targeting the following sectors:

 Cloud Computing

 Networking

 Wireless Infrastructure

 Digital Media

(http://www.tilera.com/sites/default/files/productbriefs/PB025_TILE‐Gx_Processor_A_v3.pdf) :

8.2 Clearspeed

Clearspeed are a company in Oxon UK that currently have a low power processor containing 96 processing elements implemented in their CSX700 chip Figure 46. Further details are available at http://www.clearspeed.com/

The main performance characteristics of the chip include:

 250MHz core clock frequency

 96 GFLOPS single or double precision

 75 GFLOPS sustained double precision DGEMM

 48 GMAC/s integer performance

 9W typical power dissipation

2011‐02‐27 Page 69 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

 192 Gbytes/s internal memory bandwidth

 2 x 4 Gbytes/s external memory bandwidth

 4 Gbytes/s chip‐to‐chip bandwidth

Figure 46 Clearspeed’s CSX 700

Clearspeed’s have shown interest in the SKA project and have indicated that their technology will be available as IP.

8.3 PicoChip

Picochip are a company from Cambridge that are producing processing chips targeting telecommunication base stations.

http://www.picochip.com/

They have a low power low cost DSP solution based on multiple 16 bit Harvard Architecture processing cores implemented in an architecture known as known as the Pico Array which is illustrated in Figure 47. The claimed cost/performance ratio is below $1/GMAC in volume.

The processor cores include for FFT and IFFT capability for up to 1024 points.

Picochip claim:

“Like an FPGA, the picoArray structure is defined at design‐time (not run‐time); tasks are distributed “physically” in space; and deterministic, cycle‐accurate simulations are possible. But, unlike an FPGA, timing closure is not an issue; design and build time is measured in minutes and seconds, not hours; development is in C or assembler; and task granularity is at

2011‐02‐27 Page 70 of 71

WP2‐040.030.011‐TD‐001 Revision : 1

the word (or sample) level, so implementation is more efficient and programming is inherently easier.”

Figure 47 PicoChip’s Pico Array Architecture.

8.4 Other Technologies

This section provides hyper‐links to other processing technologies that are of interest but not necessarily directly applicable to the SKA. For example, the Netronome network processor is designed for streamed processing using 40 processor cores and support for 10 G bit Ethernet interface but is designed for deep packet inspection rather than signal processing.

 Cavium: real time Deep Packet Inspection processing technology

http://www.caviumnetworks.com/newsevents_Caviumnetworks_Heavy‐Reading‐ Report.html

 Aspex Semiconductor: Real time video encoding DSP technology

http://www.aspex‐semi.com/

 Freescale:

http://www.freescale.com/

 Storm Stream processor

http://www.streamprocessors.com/

 Netronome Network Processor:

http://www.netronome.com/

2011‐02‐27 Page 71 of 71