Exploring the Effects of Hyperthreading on Scientific

ExploringExploring thethe EffectsEffects ofof HyperthreadingHyperthreading onon ScientificScientific ApplicationsApplications

by Kent Milfeld [email protected]

Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau

The University of Texas TACC TEXAS ADVANCED COMPUTINGCOMPUTING CENTERCENTER At Austin TACCTACC LinuxLinux ClusterCluster

• 3.7 TFLOPS Cray-Dell Machine (to be installedinstalled inin July)July) • 300 Node Linux Cluster – CrayRx (SDSC Rocks + Cray System Admin) – Dell Service • Dell 1750 Xeon Dual-Processor Nodes • 3.0GHz processors dual channel 266MHz DDRAM (1.0GB/cpu) • Myrinet CLOS configuration (2Gb/sec switch, “D” adapters)

The University of Texas 2 TACC At Austin LinuxLinux ClusterCluster

Switch * Server

PC+ interne interne PC PC PC PC tt …

PC+ Switch File * Server The University of Texas 3 TACC At Austin NodeNode

Dell PowerEdge 2650 / 1750

2U/1U

Processors: Two 3.0 GHz Intel¨ Xeon Processors Chipset: ServerWorks Grand Champ LE chipset

Memory: 2GB dual channel (266 MHz DDR SDRAM) FSB: 533 MHz (Front Side Bus) Cache: 512KB L2 Advanced Transfer Cache Disk: Dual-channel integrated Ultra3 (Ultra160) SCSI Adaptec¨ AIC-7899 (160Mb/s) controller The University of Texas 4 TACC At Austin Hyper-ThreadingHyper-Threading TechnologyTechnology

1 OR 2 1 AND 2

ThreadProcessTask 1 1 1 ProcessThreadTask 11

STATE 0 STATE 0 ThreadProcessTask 2 2 2 ProcessThreadTask 22 STATE 1

Single- Hyper- State Threaded Processor Processor Non-HT HT

The University of Texas 5 TACC At Austin Hyper-ThreadingHyper-Threading TechnologyTechnology

• Hyper-Threading is an implementation of an architectural technique called Simultaneous MultiThreading (SMT) • Exploits Instruction Level Parallelism on a Single CPU • Why? Comm. Server Workload efficiencyefficiency isis aboutabout 67%67% • Performance Benefits from “independent” processes/threads – Any time there isis aa stallstall forfor resourcesresources onon aa threadthread – Any time disparatedisparate operationsoperations areare usedused – Commercial: SoftwareSoftware (Algorithm(Algorithm && CodeCode Modification)Modification) Multithreading – HPC: AlreadyAlready hashas largelarge numbernumber ofof parallelparallel applicationsapplications • What about the SHARING?

The University of Texas 6 TACC At Austin ArchitectureArchitecture • Intel Xeon – Seven-way superscalarsuperscalar – Able to pipeline 128 instructions • Intel Solution: – Efficiency instead of Redundancy – With only 5% more real estate (die area) • HT 2 Logical processors / CPU – Share most of the processor resources – Two processes and/or tasks execute in logical units concurrently

The University of Texas 7 TACC At Austin MicroprocessorMicroprocessor ArchitectureArchitecture System Bus

Memory Interface Frequent Path Bus Less Frequent Path

L3 Cache (Optional, Server Products)

L2 Cache L1 Cache 8-way 4-way

Trace Fetch/ Execution Retirement Decode Cache out-of-order core Front End

BTB/Branch Prediction Branch History Update

The University of Texas 8 TACC At Austin PipelinePipeline DepthDepth

Pentium III processor misprediction pipeline 1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec

1 2 34 56 78 91011121314151617181920 TC FetchTC FetchDriveAlloc Rename Que SchSch SchDisp DispFR FR Ex Flgs BrCk Drive Pentium 4 processor misprediction pipeline +Physical (renaming) Registers: 128 Integer and 128 Floating Point

The University of Texas 9 TACC At Austin RedesignedRedesigned PipelinePipeline forfor HTHT

L2 Access Queue Decode Queue Trace Cache Queue

ITLB DECODE Front End

L2 Access

Rename Queue Sched Reg Exec. L1 Reg. Retire Allocate Read Cache Write Out-of-Order Reg. Execution Alias Store Engine Table Buffer

Allocate Re-Order Registers Registers Buffer

8-12 deep D-Cache

The University of Texas 10 TACC At Austin MemoryMemory MeasurementsMeasurements & Application ScalingScaling withwith HTHT • Memory Characterization for a single processor – Latency – Bandwidth • Memory Characterization for HT – Worst Use of Memory (non-strided Gather/Scatter) – Best Use of Memory (sequential access) • Applications (MxM, MD, LP)

The University of Texas 11 TACC At Austin MemoryMemory LatencyLatency

• Measuring Latency

I1 = IA(1) DO I = 2,N I2 = IA(I1) I1 = I2 END DO

The University of Texas 12 TACC At Austin MemoryMemory LatencyLatency MeasurementMeasurement

Single and Dual Processor Latency Single CPU Dual CPU0 Dual CPU1 500

400 100 300 80 60 200 40 20 100 0 0.1 1 10 100 1000 0 Latency (Clock Periods) Latency (Clock 0 5000 10000 15000 20000 Size (Kbytes)

The University of Texas 13 TACC At Austin BandwidthBandwidth (with(with 22 Streams)Streams)

• Measuring Bandwidth

DO I = 1,N S = S + A(I) T = T + B(I) END DO

The University of Texas 14 TACC At Austin MemoryMemory BandwidthBandwidth

Memory Read

14000 12000 10000 2.3GB/sec 8000 Total 6000 4000 2000

Bandwidth (MB/sec) Bandwidth 0 0 500 1000 1500 2000 Size (Kbytes)

The University of Texas 15 TACC At Austin HTHT MemoryMemory LatencyLatency MeasurementMeasurement

Single andHyperThreading Dual Processor Latency, Latency 3 tasks on 2 processors Single CPU 700 DualLP0 CPU0 600 DualLP1 CPU1 500 500 LP2 400 400 140 100 300 120 80 100 60 300 80 40 200 60 200 40 20 100 20 0 100 0.1 1 10 1000 1000 0 0.1 1 10 100 1000 Latency (Clock Periods) (Clock Latency Latency (Clock Periods) (Clock Latency 00 5000 10000 15000 20000 0 5000 10000Size (Kbytes) 15000 20000 Size (Kbytes)

The University of Texas 16 TACC At Austin HTHT MemoryMemory LatencyLatency MeasurementMeasurement

HyperThreading Latency, 4 tasks on 2 processors

800 LP0 LP1 700 LP2 600 LP3

500 150 400 100 300 50 200 0 0.1 1 10 100 1000 100

Latency (Clock Periods) (Clock Latency 0 0 5000 10000 15000 20000 Size (Kbytes)

The University of Texas 17 TACC At Austin IntroIntro

Xeon Xeon

4.3GB/sec Memory 2.1GB/sec North Dell Xeons 2.1GB/sec Bridge (July, 2003) Memory

System used In following Xeon Xeon In Analyses 3.2GB/sec Memory 1.6GB/sec North Experimental 1.6GB/sec Bridge Cluster Memory

The University of Texas 18 TACC At Austin HTHT MemoryMemory BandwidthBandwidth MeasurementMeasurement

Memory Read (distributed memory model) Memory Read (distributed memory model)

14000 14000 12000 LP0 12000 CPU0 LP1 10000 CPU1 10000 LP2 8000 8000 6000 2.20GB/sec 6000 2.05GB/sec 4000 Total 4000 Total 2000 2000 Bandwidth (MB/sec) Bandwidth (MB/sec) Bandwidth 0 0 0 500 1000 1500 20000 500 1000 1500 2000 Size (Kbytes) Size (Kbytes)

The University of Texas 19 TACC At Austin HTHT MemoryMemory BandwidthBandwidth MeasurementMeasurement

Memory Read (distributed memory model)

10000 LP0 8000 LP1 6000 LP3 1.73GB/sec 4000 LP4 Total 2000

Bandwidth (MB/sec) Bandwidth 0 0 500 1000 1500 2000 Size (Kbytes)

The University of Texas 20 TACC At Austin HTHT MemoryMemory BandwidthBandwidth MeasurementMeasurement

Memory Read (shared memory model) 10000 30GB/sec LP0 8000 Total LP1 6000 3.6GB/sec Total LP2 4000 Cache LP3 2000 Friendly Bandwidth (MB/sec) Bandwidth 0 0 500 1000 1500 2000 Size (Kbytes)

The University of Texas 21 TACC At Austin ParallelParallel Matrix-MatrixMatrix-Matrix MultiplyMultiply

Single-Matrix Matrix Multiply

6000

5000

4000

3000 1thread task 2000 2threads tasks 3threads tasks

Performance (GFLOPS) Performance 4threads tasks 1000

0 0 200 400 600 800 1000 Size (matrix order)

The University of Texas 22 TACC At Austin MolecularMolecular DynamicsDynamics

Molecular dynamics simulation of a 256 particle argon lattice One picosecond Verlet algorithm

Threading 1 Thread 2 Threads 3 Threads 4 Threads HT HT

Time(sec): 7.95 4.06 3.89 3.63 Scaling: 1 1.96 2.04 2.19

The University of Texas 23 TACC At Austin MonteMonte CarloCarlo LithographyLithography SimulationSimulation

10**7 Monte Carlo iterations Each thread outputs the lattice configuration and various Monte Carlo statistics to disk.

Threading 1 Thread 2 Threads 3 Threads 4 Threads HT HT

Time(sec): 19.9 15.7 13.1 11.5 Scaling: 1 1.27 1.52 1.73

The University of Texas 24 TACC At Austin ConclusionsConclusions HTHT PerformancePerformance • Multiple “independent” Processes/Tasks/Threads are necessary.necessary. • Bandwidth-limited and/or Compute Intensive codes may see degradation of performance. • Non-extreme Code may see performance enhancements – Cache sharing (synchronization may be required) – Non-Streaming memory access – I/O – Disparate operations (e.g., integer mult. & float mult.)

The University of Texas 25 TACC At Austin