VYSOKÁ ŠKOLA BÁŇSKÁ – TECHNICKÁ UNIVERZITA OSTRAVA Fakulta elektrotechniky a informatiky

Referát do předmětu PAP

Architektura 2

My Chinh Nguyen Hong Ha Vu

2008

1.Why 64-bit ?...... 3

2.History...... 3

2.1 Itanium (1) ...... 3 2.2 Itanium 2...... 4

3.Overview ...... 4

3.1 Block Diagram ...... 4 3.2 Compare Itanium & Itanium 2...... 5 3.3 6-Wide EPIC Core...... 5 3.4 Software Pipelining: ...... 7 4.Execution Core ...... 8

5.Control...... 8

6.Instruction execution...... 8

7.Memory subsystem...... 9

7.1 L1 cache...... 9 7.2 L2 cache...... 10 7.3 L3 cache...... 10 7.4 The Advanced Load Address Table (ALAT) ...... 10 7.5 Translation Lookaside Buffers (TLBs)...... 10

8.Future processors ...... 11

2 1.Why 64-bit ? The system has 32-bit good enough for most applications is 32-bit or more, but many businesses are switching to use computing 64-bit for some reason. One of the reasons is that businesses want to take advantage of the number of applications 64-bit, more and more applications that require this processor speed and memory faster. Combining an architecture processor with 64-bit memory in a larger drag allows multiple CPU data into memory and handling it directly. This will not only increase speed applications but also increase the speed of both operating systems. The host 64-bit is being used more and more applications for enterprise specializing in 64-bit implementation of the calculation. Especially for the database related to flow more work is the time that speed should be enhanced by the ability to handle parallel to the processor. Applications business professionals perform business must quickly undertake a work of processing giants such as the data requires processing speed must be strengthened by the volume of stored data in memory Buffer of the processor and main memory. The application site includes e-commerce with the transaction volume or large archive data does not puncture the planning resources of the enterprise management customer relations, search data and business information. Other applications including more are applying science, technology, personnel, handling online analysis and financial models. The application requires speed processing and is ready to do so high is typical of an industry along a host of other sectors such as finance, securities, insurance, processing, energy, oil and gas and life sciences live. For this industry solutions running on systems running processors Itanium 2 is most relevant.

32bit 1cm One Cd cover height 64 bit 429496 km Distance between earth and moon

2.History

2.1 Itanium (1)

Code Name: Merced Shipped in June of 2001 180 nm process 733 / 800 MHz Cache 2 MB or 4 MB off- die The only version of Itanium ƒ Itanium: Merced Core

The Itanic - Original Itanium was expensive and slow executing 32 bit code

3 2.2 Itanium 2

Common Features: 16 kB L1 I-cache, 16 kB L1 D-cache, 256 kB L2 cache Revisions: ƒ McKinley, Madison, Hondo, Deerfield, Fanwood, , Montvale Upcoming Revisions: ƒ Tukwila, Poulson

3. Overview

3.1 Block Diagram

Instruction Processing The instruction processing block contains the logic for instruction prefetch, instruction fetch, L1 instruction cache, branch prediction, instruction address generation, instruction buffers, instruction issue, dispersal and rename.

Execution The execution block consists of the multimedia logic, integer ALU execution logic, floating- point (FP) execution logic, integer register file, L1 data cache and FP register file.

Control The control block consists of the exception handler and the pipeline control, as well as the Register Stack Engine (RSE).

4 Memory Subsystem The memory subsystem contains the unified L2 cache, on-chip L3 cache, Programmable Interrupt Controller (PIC), instruction and data Translation Look aside Buffers (TLB), Advanced Load Address Table (ALAT) and external system bus interface logic.

IA-32 Compatibility Execution Engine Instructions for IA-32 applications are fetched, decoded and scheduled for execution by the IA- 32 compatibility execution engine.

3.2 Compare Itanium & Itanium 2

Features Itanium Itanium 2 Integer ALU's 4 6 Multimedia ALU's 4 6 Extended Precision FP Units 2 2 Single Precision FP units 2 2 Load or Store Units 2 2 Branch Units 3 3 Stage 6 Wide Pipeline 10 8 L1 Cache 32k 32k L2 Cache 96k 256K L3 Cache 4MB (extern) 3MB (on die) will be up to 24MB Clock 800Mhz 1Ghz Clock initially Up to 1.66Ghz on Montvale Process 180nm 180nm Increased to 130nm in 2003 Further increased to 90nm in 2007 System Bus Speed 2.1GB/s 6.4GB/s 266Mhz 400Mhz 64 bit Wide 128 bit Wide

3.3 6-Wide EPIC Core

Benefit

Compiler has more time to spend with code Time spent by compiler is a one-time cost Reduces circuit complexity Runtime behavior isn’t always obvious in source code Runtime behavior may depend on input data Depends greatly on compiler performance

Epic, is a combination of creative reasoning, the prediction and the implementation of parallel, with the advances in techniques of the process, the technology CISC and RISC have limited

5 relevance to address. While both structures coffee RISC CISC used the technique in different parties to try to process more than one command at a time at the time may be the degree of parallel in the code is determined by time of a part of the process that they are analyzing and order in the Re-order during the process data. Epic allows software to exchange separate clearly to the process because when the work can be done in parallel with each other without through the structure of the normal process. As a result of the process can get the script commands the largest possible and make them a continuous and reduce the cost process. Increase performance by reducing the number of branches and branch prediction errors and reduce the effects of the time delay of the memory of the process. The structure of the script IA-64 was introduced widely in May 1999. EPIC technology offers vast resources and it represents the most progress of the structure of the process from 386 in 1985. The process of the IA-64 will have the resources include large record number of 128 raw, 128 to write a comma active, 64 the record number of properties under the bar for the purpose of recording separately. The order will be forming groups to carry out parallel to each other by the Division different functions. The script has been optimized to send directly to the need for encryption, decoded video and other features that will improve performance in the congviec thehe server and workstation next. Support MMX technology, the command expanded SIMD be maintained and expanded in the process of the IA-64. IA-64 clearly is not the version of 64-bit architecture 32-bit and not structured PA-RISC 64- bit HP's that it provides measures to protect those applications exist today and infrastructure software by maintaining compatibility with the hardware of the old process by passing a translation software. Although how everything related to the ISA are programs and optimize the command line to exclude cases where it is to translate again. IA-64 running the software 32-bit AMD was only excerpts fervor, AMD propose its own to support 64-bit code and address of the memory with the code name "Sledgehammer" and not run with the old software . Sledgehammer AMD's x86-64, designed to expand the IA-32 architecture to support 64-bit code and address of the memory. According to the chart illustrates, places most serious is the program with 02 characteristics of the new IA-64: • The affirmation, it is the place to turn branch prediction by allowing the process of implementing all the branches can simultaneously and in parallel with each other. • Tai data reasoning, allowing the process of the IA-64 offers data on the program before it should, even intellectual both branches remote than that does not perform. The affirmation is central to exclude branches of the IA-64 on the list and order in parallel. Normally, the program pays the interpreter command turns branches (such as IF, THEN, ELSE) into the block contact with each other's order flow in the computer code process data continuously. Depending on the results of the branch, CPU will perform one of the basic block by block through another jump. This kind of CPU try to predict results and implementation of deductive block landing, it will take a long time if the prediction wrong. The basic block is very small, typically 02 or 03 orders, and about 06 orders happens turn branches. The continuous, natural like the sea waves of this code will perform the work in parallel difficulties. When programs on IA-64 find the branches are available in source code, it analyzes the branch to find out if the branch is affirmed accurate and do all the commands that follow each branch of the road with a unique identity call is "affirmation" for appropriate. After the commands "affirmed" the program and determine the order that the CPU can be done in parallel, for

6 example, those commands cặp from the results of the branches because they are independent way, although they belong together a program. The program then selection of the command code into the machine 128-bit, including 3 orders. School samples (Template) of not only identify the order in can be done independently, but also can identify the independent command in the next. Therefore, if the program and find that 16 orders are independent of each other, it can package them into 06 different bo (03 command / bo bo 01 more and more is 06) and mark them with flags (Flag) in the pattern. At the time of running, the CPU scans Template and get out the commands that are not related to each other and then send them parallel to the Department functions. Then CPU arrange the commands that based on the requirements of them. When you find the CPU branch asserted, it does not try to predict which way branch will work, and it does not jump through the block of code to perform deductive way to predict. Instead, CPU start implementing the code for all results can branch. Thus there is no branch level and that will only happen when you break the code flow program and rearrange the order in parallel to carry out orders. At some point, of course, as is the case, CPU will eventually re-evaluate the accounting allowed to compare that fit the type a command such as IF-THEN. Along this time, CPU has the ability to perform several commands from the two roads can be so - but not storage results immediately. Only such a time, CPU implementation store the results of the exact road and cancellation of results of the road is not valid. At run time, CPU encounter command Tai reasoning first and try to get the data from memory. Here the process of IA-64 is different from those of the normal process. Sometimes the load data is not valid - it may belong to the block of code further with branches that have not implemented immediately. CPU traditional immediately switch to inform Eliminate - and if the program does not control Eliminate, so the system fails. For the process of IA-64, not immediately announced Eliminate if the load is not valid. Instead, CPU delayed until Exclusion when it encountered command to check speculative guess that having Tai reasoning. Then a new CPU Eliminate report. That way CPU has resolved branches that lead to Dismiss in the first place. If the road to the load that it related to the return invalid, then the load is not valid, therefore CPU have gone before and informed of the Exclusion. But if the load is valid, then the Exclusion never happen.

3.4 Software Pipelining:

Take advantage of programming trends and large number of available registers. Allow multiple iterations of a loop to be in flight at once. Pipelines are designed for very low latency

Short 8-stage in-order main pipeline

In-order issue, out-of-order completion Reduced branch misprediction penalties Fully interlocked, no way-prediction or flush/replay mechanism

7

Itanium Itanium 2 10 Stage 8 stage Instruction Pointer Generation Instruction Pointer Generation Fetch Rotate Rotate Expand Expand Rename Rename Register Read Word-Line Decode Execute Register Read Detect Execute Write Back Exception Detect Write Back

4. Execution Core

The Itanium 2 processor execution logic consists of six multimedia units, six integer units, two floating-point units, three branch units and four load/store units. The Itanium 2 processor has general registers and FP registers to manage work in progress. Integer loads are processed by the L1 data cache but integer stores will be processed by L2. FP loads and stores are also processed by the L2 cache. Whenever a lookup occurs in L1, a speculative request is sent to the L2 cache. The multimedia engines treat the 64-bit data as 2 x 32-bit, 4 x 16-bit or 8 x 8-bit packed data types. Three classes of arithmetic operations can be performed on the packed or Single Instruction Multiple Data (SIMD) data types: arithmetic, shift and data arrangement. Meanwhile the integer engines support up to six non-packed integer arithmetic and logical operations. Up to six integer or multimedia operations can be executed each cycle.

Itanium Itanium 2 4 ALU 6 multimedia units 4 MMX 6 integer units 2 + 2 FMAC 2 FPU 2 Load / Store 3 branch units 3 branch 4 load / store units

5.Control

The control section of the Itanium 2 processor is made up of the exception handler and pipeline control. The exception handler implements exception prioritizing. Pipeline control has a scoreboard to detect register source dependencies and a cache to support data speculation. The machine stalls only when source operands are not yet available. Pipeline control supports predication via predication registers.The pipeline control section also contains a Performance Monitoring Unit designed to collect data that can be dumped for analyzing Itanium 2 processor performance.

6.Instruction execution Each 128-bit instruction word contains three instructions, and the fetch mechanism can read up to two instruction words per clock from the L1 cache into the pipeline. When the compiler can take maximum advantage of this, the processor can execute six instructions per clock cycle. The

8 processor has thirty functional execution units in eleven groups. Each unit can execute a particular subset of the instruction set, and each unit executes at a rate of one instruction per cycle unless execution stalls waiting for data. While not all units in a group execute identical subsets of the instruction set, common instructions can be executed in multiple units. The execution unit groups include: Six general-purpose ALUs, two integer units, one shift unit Four data cache units Six multimedia units, two parallel shift units, one parallel multiply, one population count Two floating-point multiply-accumulate units, two "miscellaneous" floating-point units Three branch units The compiler can often group instructions into sets of six that can execute at the same time. Since the floating-point units implement a multiply-accumulate operation, a single floating point instruction can perform the work of two instructions when the application requires a multiply followed by an add: this is very common in scientific processing. When it occurs, the processor can execute four FLOPs per cycle. For example, the 800 MHz Itanium had a theoretical rating of 3.2 GFLOPS and the fastest Itanium 2, at 1.67 GHz, was rated at 6.67 GFLOPS.

7.Memory subsystem

All caches are physically indexed, pipelined, and non-blocking: score boared registers allow continued execution until load use.

7.1 L1 cache

High Performance 32GB/s, 2 ld AND 2 st ports Write Through – all stores are pushed to the L2 FP loads force miss, FP stores invalidate True dual-ported read access – no load conflicts

9 Pseudo-dual store port write access – 2 store coalescing buffers/port hold data until L1D update Store to load forwarding One clock data cache provides a significant performance benefit

7.2 L2 cache

L2 256KB, 32GB/s, 5 clk Data array is pseudo-4 ported - 16 banks of 16KB each Non-blocking/out-of-order L2 queue (32 entries) - holds all in-flight load/stores out-of-order service - smoothes over load/store/bank conflicts, fills Can issue/retire 4 stores/loads per clock Can bypass L2 queue (5,7,9 clk bypass) if ƒ no address or bank conflicts in same issue group ƒ no prior ops in L2 queue want access to L2 data arrays Large iL3 3MB, 32GB/s, 12 clk cache on die. Single ported – full cache line transfers Large o n die L2 and L3 cache provides significant performance potential

7.3 L3 cache

The on chip L3 cache on the Itanium 2 processor is 1.5 MB or 3 MB in size. It is physically indexed and physically tagged. The L3 cache is single ported, fully pipelined non-blocking cache featuring 12 way set-associative with 128 line size. It can support 8 outstanding requests, 7 of which are loads/stores and 1 is for fills. The maximum transfer rate from L3 to core/L1I/L1D or L2 is 32 GB/cycle. The L3 protects both tag and data with single bit correction and double bit detection ECC.

7.4 The Advanced Load Address Table (ALAT)

A cache structure called the Advanced Load Address Table (ALAT) is used to enable data speculation in the Itanium 2 processor. The ALAT keeps information on speculative data loads issued by the machine and any stores that are aliased with these loads. This structure has 32 entries, is a fully associative array that can handle two loads and two stores per cycle. It can provide aliasing information for the advance load “check” operations.

7.5 Translation Lookaside Buffers (TLBs)

There are two types of TLBs on the Itanium 2 processor: Data Translation Lookaside Buffer (DTLB) and the Instruction Translation Lookaside Buffer (ITLB). There are two levels of DTLBs in the Itanium 2 processor: a L1 DTLB and a L2 DTLB. Only L1D cache loads depend on the L1 and L2 DTLB hits. Stores and L2/L3 cache hits only depend on the L2 DTLB hits. TLB misses in either the DTLB or the ITLB are serviced by the hardware page table walker which supports the Itanium instruction set architecture-defined 8B and 32B Virtual Hash Page Table (VHPT) format. VHPT data is only cached on the L2 and L3 caches, not the L1D.

10 8. Future processors

Tukwila will be released in late 2008. Tukwila's die size is 21.5x32.5mm2, and the first run of chips will reach speeds of up to 2GHz on 170W of power. Intel will also release a 130W SKU that the company claims will double the performance of the dual-core Montvale (the 9100 series Itanium) on a mix of TPC-C, specintrate, and specfprate benchmarks. The die micrograph below shows Tukwila's four cores, each of which is capable of executing two threads at once for a total of 8 simultaneous threads per socket. The cores are surrounded by a sea of L2 cache— 30MB to be exact. With that much cache, Tukwila is sure to shine on branch-intensive database benchmarks. Also lending Tukwila a huge hand on benchmarks will be the QuickPath interconnect technology that Intel will debut with the new chip. QuickPath, formerly called Common Systems Interconnect, is Intel's long-delayed answer to AMD's HyperTransport. Intel is promising peak bandwidths of up to 96GBps for processor-to-processor links, and memory bandwidths of up to 34GBps from Tukwila's four on-die FB-DIMM channels.

11