Multithreading

Multithreading allows multiple threads to share the functional units of a single processor Process switch on a Issue slots multiprocessed system takes 100s - 1000s of cycles. Thread A Thread B Thread C Thread D Saves PC, registers, other process state, PTBR etc. Perhaps flush cache and TLB Time We need thread switch to be fast and efficient. Threads share the same VM so a thread switch only needs to save PC and registers. In multithreading, we provide each thread its own PC and registers. Issue slots Coarse Grained MT: Switch threads only on Coarse MT expensive stalls such as L2 Time cache misses. Because it hides only expensive stalls it doesn't have to be as efficient. Old thread empties pipe, new thread fills up pipe. Simplifies hazards. FIGURE 6.5 How four threads use the issue slots of a superscalar processor in different approaches. The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support. The three examples at the bottom show how they would execute running together in three multithreading options. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of gray and color correspond to four different threads in the multithreading processors. The additional pipeline start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss in throughput for coarse multithreading. Copyright©2009 Elsevier, Inc. All rights reserved. Multithreading allows multiple threads to share the functional units of a single processor Process switch on a Issue slots multiprocessed system takes 100s - 1000s of cycles. Thread A Thread B Thread C Thread D Saves PC, registers, other process state, PTBR etc. Perhaps flush cache and TLB Time We need thread switch to be fast and efficient. Threads share the same VM so a thread switch only needs to save PC and registers. In multithreading, we provide each thread its own PC and registers. Issue slots Fine Grained MT: Switch threads round-robin on Coarse MT Fine MT available threads on every Time instruction. Needs more complex hardware. Can hide losses from shorter stalls. Slows down individual threads. No empty slots Can’t fill all issue slots FIGURE 6.5 How four threads use the issue slots of a superscalar processor in different approaches. The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support. The three examples at the bottom show how they would execute running together in three multithreading options. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of gray and color correspond to four different threads in the multithreading processors. The additional pipeline start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss in throughput for coarse multithreading. Copyright©2009 Elsevier, Inc. All rights reserved. Multithreading allows multiple threads to share the functional units of a single processor Process switch on a Issue slots multiprocessed system takes 100s - 1000s of cycles. Thread A Thread B Thread C Thread D Saves PC, registers, other process state, PTBR etc. Perhaps flush cache and TLB Time We need thread switch to be fast and efficient. Threads share the same VM so a thread switch only needs to save PC and registers. In multithreading, we provide each thread its own PC and registers. Issue slots Simultaneous MT: Needs dynamic multiple issue Coarse MT Fine MT SMT Multiple threads can use issue Time slots in a single clock cycle Uses register renaming and dynamic scheduling to associate instruction slots and renamed registers with proper threads. Improves utilization, but needs more complex hardware. Finds more ILP among multiple threads. FIGURE 6.5 How four threads use the issue slots of a superscalar processor in different approaches. The four threads at the top show how each would execute running alone on a standard superscalar processor without multithreading support. The three examples at the bottom show how they would execute running together in three multithreading options. The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of gray and color correspond to four different threads in the multithreading processors. The additional pipeline start-up effects for coarse multithreading, which are not illustrated in this figure, would lead to further loss in throughput for coarse multithreading. Copyright©2009 Elsevier, Inc. All rights reserved. 2.00 o ti a r y c n 1.75 ie ic ff e y rg e n 1.50 e d Speedup n a e c n a 1.25 rm o rf e p Energy ef T M 1.00 S i7 ficiency 0.75 FIGURE 6.6 The speed-up from using multithreading on one core on an i7 processor a v improerages 1 Copyright©2014 Elsevier Blackscholes vement is 1 Bodytrack .3 1 for the P Canneal .0 Facesim 7 , Inc. This ARSECdata was benc collected and analyzedFer byre Esmaeilzadeht et. al. [201 All rights reserved. Fluidanimate Raytrace hmarks (see SectionSt 6.9)rea andmc ltheus tenergyer ef Swaptions Vips x264 ficiency 1]. Ÿ Multi-threading was not introduced to deal with power issues, it was more concerned with keeping all the functional units busy which actually needs more power. The Intel Nehalem multi-core processor uses two-threads per core SMT, but the power wall is forcing designs to simpler, more power efficient cores. Ÿ The goal of multi-threading is not so much improving performance as improving utilization by sharing resources. This also happens in Multi-core processors which share L3 caches and Floating Point units etc. Ÿ Fine grained multi-threading can easily hide cache misses, so it helps with memory latency issues. Processor Processor . Processor Cache Cache . Cache Interconnection Network Memory I/O FIGURE 6.7 Classic organization of a shared memory multiprocessor Single (physical) address space for all processors Ÿ Programs are available to run on any processor Ÿ Each processor can still run independent jobs in their own virtual memory spaces Ÿ Processors communicate with each other through shared memory synchronized with semaphores. UMA: Uniform Memory Access same time to access memory independent of what processor is doing the access and what address is being accessed NUMA Ÿ accesses to memory nearer the processor is faster Ÿ more programming challenges for performance Ÿ handles larger number of processors (> 64) Example: Sum 100,000 numbers using a 10 processor SMP P0: A[0] + ... + A[9,999] = Sum[P0] P1: A[10,000] + ... + A[19,999] = Sum[P1] P2: A[20,000] + ... + A[29,999] = Sum[P2] P3: A[30,000] + ... + A[39,999] = Sum[P3] P4: A[40,000] + ... + A[49,999] = Sum[P4] P5: A[50,000] + ... + A[59,999] = Sum[P5] P6: A[60,000] + ... + A[69,999] = Sum[P6] P7: A[70,000] + ... + A[79,999] = Sum[P7] P8: A[80,000] + ... + A[89,999] = Sum[P8] P9: A[90,000] + ... + A[99,999] = Sum[P9] Example: Sum 100,000 numbers using a 10 processor SMP P0: A[0] + ... + A[9,999] = Sum[P0] P1: A[10,000] + ... + A[19,999] = Sum[P1] P2: A[20,000] + ... + A[29,999] = Sum[P2] P3: A[30,000] + ... + A[39,999] = Sum[P3] P4: A[40,000] + ... + A[49,999] = Sum[P4] P5: A[50,000] + ... + A[59,999] = Sum[P5] P6: A[60,000] + ... + A[69,999] = Sum[P6] P7: A[70,000] + ... + A[79,999] = Sum[P7] P8: A[80,000] + ... + A[89,999] = Sum[P8] P9: A[90,000] + ... + A[99,999] = Sum[P9] Example: Sum 100,000 numbers using a 10 processor SMP P0: A[0] + ... + A[9,999] = Sum[P0] P1: A[10,000] + ... + A[19,999] = Sum[P1] P2: A[20,000] + ... + A[29,999] = Sum[P2] P3: A[30,000] + ... + A[39,999] = Sum[P3] P4: A[40,000] + ... + A[49,999] = Sum[P4] P5: A[50,000] + ... + A[59,999] = Sum[P5] P6: A[60,000] + ... + A[69,999] = Sum[P6] P7: A[70,000] + ... + A[79,999] = Sum[P7] P8: A[80,000] + ... + A[89,999] = Sum[P8] P9: A[90,000] + ... + A[99,999] = Sum[P9] Example: Sum 100,000 numbers using a 10 processor SMP P0: A[0] + ... + A[9,999] = Sum[P0] P1: A[10,000] + ... + A[19,999] = Sum[P1] P2: A[20,000] + ... + A[29,999] = Sum[P2] P3: A[30,000] + ... + A[39,999] = Sum[P3] P4: A[40,000] + ... + A[49,999] = Sum[P4] P5: A[50,000] + ... + A[59,999] = Sum[P5] P6: A[60,000] + ... + A[69,999] = Sum[P6] P7: A[70,000] + ... + A[79,999] = Sum[P7] P8: A[80,000] + ... + A[89,999] = Sum[P8] P9: A[90,000] + ... + A[99,999] = Sum[P9] Example: Sum 100,000 numbers using a 10 processor SMP P0: A[0] + ... + A[9,999] = Sum[P0] P1: A[10,000] + ... + A[19,999] = Sum[P1] P2: A[20,000] + ... + A[29,999] = Sum[P2] P3: A[30,000] + ... + A[39,999] = Sum[P3] P4: A[40,000] + ... + A[49,999] = Sum[P4] P5: A[50,000] + ... + A[59,999] = Sum[P5] P6: A[60,000] + ..

Load more