FPL-2019 Why-I-Love-Latency

Total Page:16

File Type:pdf, Size:1020Kb

FPL-2019 Why-I-Love-Latency why I love memory latency high throughput through latency masking hardware multithreading WALID NAJJAR UC RIVERSIDE the team the cheerleaders: Walid A. Najjar, Vassilis J. Tsotras, Vagelis Papalexakis, Daniel Wong the players: • Robert J. Halstead (Xilinx) • Ildar Absalyamov (Google) • Prerna Budhkar (Intel) • Skyler Windh (Micron) • Vassileos Zois (UCR) • Bashar Roumanous (UCR) • Mohammadreza Rezvani (UCR) Yale is 80 - © W. Najjar !2 outline ✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion FPL 2019 - © W. Najjar !3 current DRAM trends Growing memory capacity and low cost have 3000 created a niche for in-memory solutions. Peak CPU Performance 2000 Low Latency memory hardware is more Peak Memory Performance expensive (39x) and has lower capacity than commonly used DDRx DRAM. Clock (MHz) 1000 Though memory performance is improving, it is not likely to match CPU performance anytime 0 soon 1990 1992 1994 1996 2000 2002 2004 2006 2008 2010 2012 2014 2016 Need alternative techniques to mitigate memory latency! K. K. Chang. Understanding and Improving Latency of DRAM-Based Memory Systems. PhD thesis, Carnegie Mellon University, 2017. Yale is 80 - © W. Najjar !4 latency mitigation - aka caching Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Intel Core-i7-5960x Launch Date Q3’14 Clock Freq. 3.0 GHz Cores / Threads 8 / 16 L3 Cache 20 MB Memory 4 Channels Memory 68 GB/s Bandwidth Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels ๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (register file, L1, L2, L3, memory, disk cache, disk) Yale is 80 - © W. Najjar !5 latency mitigation - aka caching Intel Core-i7-5960x ๏ caches > 80% of Launch Date Q3’14 CPU area, Clock Freq. 3.0 GHz ๏ big energy sink! Cores / Threads 8 / 16 ๏ application must have temporal and/ L3 Cache 20 MB Memory 4 or spatial locality Channels ๏ up to 7 levels Memory 68 GB/s storage hierarchy Bandwidth (register file, L1, L2, L3, memory, disk cache, disk) caches everywhere! reduces average memory latency assumes locality Yale is 80 - © W. Najjar !5 where cycles go - cache misses FPL 2019 - © W. Najjar !6 where cycles go - cache misses [DaMoN’13] Tözün, P. et al. OLTP in wonderland: where do cache misses come from in major OLTP components? FPL 2019 - © W. Najjar !6 where cycles go - cache misses [DaMoN’13] Tözün, P. et al. OLTP in [TODS’07] Chen, S. et al. Improving Hash Join wonderland: where do cache misses Performance through Prefetching come from in major OLTP components? FPL 2019 - © W. Najjar !6 big data analytics Yale is 80 - © W. Najjar !7 big data analytics BIG data (x petaB) Yale is 80 - © W. Najjar !7 big data analytics BIG data (x petaB) + Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data sparse data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data sparse data marshaling Yale is 80 - © W. Najjar !7 big data analytics BIG data hyper (x petaB) + dimensional no data sparse data marshaling no locality! Yale is 80 - © W. Najjar !7 You have to know the past to understand the present - Carl Sagan FPL 2019 - © W. Najjar !8 the latency problem & solutions ✦ latency mitigations, aka “caching” ❖ effective only with spatial/temporal localities ✦ streaming: latency incorporated into pipeline ❖ requires spatial locality ✦ latency masking, aka hardware multithreading ❖ fast context switch after memory access ❖ small thread context ❖ concurrency == outstanding memory requests FPL 2019 - © W. Najjar !9 Burton Smith (1941 - 2018) FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT ❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT ❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming FPL 2019 - © W. Najjar !10 Burton Smith (1941 - 2018) ❖ main architect of ❖ Denelcor HEP (1982) ❖ Horizon (1985-88) @ IDA SRC ❖ Tera MTA (1999) later Cray XMT ❖ latency masking ❖ = do useful work while waiting for memory = multithreading ❖ switch to a ready process on long-latency I/O operations ❖ => higher throughput ❖ 50 years ago it was I/O latency => multi-programming ❖ https://www.microsoft.com/en-us/research/people/burtons/ FPL 2019 - © W. Najjar !10 Tera MTA/Cray XMT ✦ supercomputer, barrel processor registers ✦ 128 independent threads/processor … ✦ 2 inst/cycle issued: 1 ALU, 1 memory ✦ 1 full/empty bit per word insures correct ordering of memory accesses datapath ✦ longest memory latency: 128 cycles ✦ randomized memory: reduces bank conflicts ✦ processor & memory nodes on a 3D torus interconnection network ✦ 4096 nodes: sparsely populated FPL 2019 - © W. Najjar !11 barrel processor ✦ CPU that switches between threads of execution on every cycle. ✦ also known as "interleaved" or "fine-grained" temporal multithreading. ✦ each thread is assigned its own program counter and other hardware registers (architectural state). ✦ guarantee each thread will execute one instruction every n cycles. ✦ n-way barrel processor acts like n separate processors, ✦ each running at ~ 1/n the original speed. thread execution thread order threadthreadstate MEMORY threadstatestate threadthreadstate threadstatestate statestate FPL 2019 - © W. Najjar !12 outline ✦ background ✦ filament execution model ✦ application - SpMV ✦ application - Selection ✦ application - Group by aggregation ✦ conclusion FPL 2019 - © W. Najjar !13 what if we take it a step further? ✦ trade bandwidth for latency ✦ custom or semi custom processor/accelerator ✦ no need for large register files and thread state ➡ smaller data path ➡ more processors ➡ more threads per processor (engine or core) ➡ much higher throughput! ✦ suited for large scale irregular applications: ➡ data analytics, databases, sparse linear algebra: SpMV, SpMM, graph algorithms, bioinformatics FPL 2019 - © W. Najjar !14 hardware multithreading FPL 2019 - © W. Najjar !15 hardware multithreading MEMORY core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament MEMORY core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament custom data path MEMORY core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament custom data path request MEMORY waiting filament core FPL 2019 - © W. Najjar !15 hardware multithreading ready filament custom data path request MEMORY response waiting filament core FPL 2019 - © W. Najjar !15 hardware multithreading thread thread moved toqueue ready ready filament custom data path request MEMORY response waiting filament core FPL 2019 - © W. Najjar !15 hardware multithreading Call them filament, to distinguish from “threads” Preempting executing filaments thread thread moved toqueue ready once it access memory ready filament Ready & Waiting Filament Queues fit 1000s of thread custom data path states request Massive parallelism: ≥ 4,000 MEMORY waiting filaments response waiting filament core FPL 2019 - © W. Najjar !15 filament data path DATA PATH read write input output data data MEMORY wait queue filament termination FPL 2019 - © W. Najjar !16 filament data path example Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra DATA PATH MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH D[i] fetch D[i] MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH D[i] C[D[i]] fetch D[i] fetch C[D[i]] MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH D[i] C[D[i]] B[i] fetch D[i] fetch C[D[i]] fetch B[i] MEMORY wait queue Yale is 80 - © W. Najjar !17 filament data path example A[i] += B[i] * C[D[i]]; % typical of sparse linear algebra generate i DATA PATH ∑ D[i] C[D[i]] B[i] write A[i] fetch D[i] fetch C[D[i]] fetch B[i] MEMORY wait queue Yale is 80 - © W.
Recommended publications
  • Processamento Paralelo Em Cuda Aplicado Ao Modelo De Geração De Cenários Sintéticos De Vazões E Energias - Gevazp
    PROCESSAMENTO PARALELO EM CUDA APLICADO AO MODELO DE GERAÇÃO DE CENÁRIOS SINTÉTICOS DE VAZÕES E ENERGIAS - GEVAZP André Emanoel Rabello Quadros Dissertação de Mestrado apresentada ao Programa de Pós-Graduação em Engenharia Elétrica, COPPE, da Universidade Federal do Rio de Janeiro, como parte dos requisitos necessários à obtenção do título de Mestre em Engenharia Elétrica. Orientadora: Carmen Lucia Tancredo Borges Rio de Janeiro Março de 2016 PROCESSAMENTO PARALELO EM CUDA APLICADO AO MODELO DE GERAÇÃO DE CENÁRIOS SINTÉTICOS DE VAZÕES E ENERGIAS - GEVAZP André Emanoel Rabello Quadros DISSERTAÇÃO SUBMETIDA AO CORPO DOCENTE DO INSTITUTO ALBERTO LUIZ COIMBRA DE PÓS-GRADUAÇÃO E PESQUISA DE ENGENHARIA (COPPE) DA UNIVERSIDADE FEDERAL DO RIO DE JANEIRO COMO PARTE DOS REQUISITOS NECESSÁRIOS PARA A OBTENÇÃO DO GRAU DE MESTRE EM CIÊNCIAS EM ENGENHARIA ELÉTRICA. Examinada por: ________________________________________________ Profa. Carmen Lucia Tancredo Borges, D.Sc. ________________________________________________ Prof. Antônio Carlos Siqueira de Lima, D.Sc. ________________________________________________ Dra. Maria Elvira Piñeiro Maceira, D.Sc. ________________________________________________ Prof. Sérgio Barbosa Villas-Boas, Ph.D. RIO DE JANEIRO, RJ – BRASIL MARÇO DE 2016 iii Quadros, André Emanoel Rabello Processamento Paralelo em CUDA Aplicada ao Modelo de Geração de Cenários Sintéticos de Vazões e Energias – GEVAZP / André Emanoel Rabello Quadros. – Rio de Janeiro: UFRJ/COPPE, 2016. XII, 101 p.: il.; 29,7 cm. Orientador(a): Carmen Lucia Tancredo Borges Dissertação (mestrado) – UFRJ/ COPPE/ Programa de Engenharia Elétrica, 2016. Referências Bibliográficas: p. 9 6-99. 1. Programação Paralela. 2. GPU. 3. CUDA. 4. Planejamento da Operação Energética. 5. Geração de cenários sintéticos. I. Borges, Carmen Lucia Tancredo. II. Universidade Federal do Rio de Janeiro, COPPE, Programa de Engenharia Elétrica.
    [Show full text]
  • Radio Frequency Identification Based Smart
    ISSN (Online) 2394-2320 International Journal of Engineering Research in Computer Science and Engineering (IJERCSE) Vol 3, Issue 12, December 2016 Simultaneous Multithreading for Superscalars [1] Aishwarya P. Mhaisekar [2] Shubham S. Bokilwar [3] Anup Pachghare Jawaharlal Darda Institute of Engineering and Technology, Yavatmal, 445001 Abstract: — As the processor community is trying to approach high performance gain from memory. One approach is to add more memory either cache or primary memory to the chip. Another approach is to increase the level of systems integration. Although integration lowers system costs and communication latency,the overall performance gain to application is again marginal.In general the only way to significantly improve performance is to enhance the processors computational capabilities, this means to increase parallelism. At present only certain forms of parallelisms are being exploited .for ex .can execute four or more instructions per cycle; but in practice, they achieved only one or two because, in addition to the low instruction level parallelism performance suffers when there is little thread level parallelism so, the better solution is to design a processor that can exploit all types of parallelism. Simultaneous multi-threading is processor design that meets these goals. Because it consumes both threads level & instruction level parallelism. Key words:--Parallelism, Simultaneous Multithreading, thread, context switching, Superscalar I. INTRODUCTION utilization, Simplified sharing and communication, Parallelization. In processor world, the execution of a thread is a smallest sequence of programmed instructions that can Threads in the same process share the same be managed by scheduler independently, which is a part address space. This allows concurrently running code of a operating system.
    [Show full text]
  • Andrzej Nowak - Bio
    Multi-core Architectures Multi-core Architectures Andrzej Nowak - Bio 2005-2006 Intel Corporation IEEE 802.16d/e WiMax development Theme: Towards Reconfigggurable High-Performance Comppguting Linux kernel performance optimizations research Lecture 2 2006 Master Engineer diploma in Computer Science Multi-core Architectures Distributed Applications & Internet Systems Computer Systems Modeling 2007-2008 CERN openlab Andrzej Nowak Multi-core technologies CERN openlab (Geneva, Switzerland) Performance monitoring Systems architecture Inverted CERN School of Computing, 3-5 March 2008 1 iCSC2008, Andrzej Nowak, CERN openlab 2 iCSC2008, Andrzej Nowak, CERN openlab Multi-core Architectures Multi-core Architectures Introduction Objectives: Explain why multi-core architectures have become so popular Explain why parallelism is such a good bet for the near future Provide information about multi-core specifics Discuss the changes in computing landscape Discuss the impact of hardware on software Contents: Hardware part THEFREERIDEISOVERTHE FREE RIDE IS OVER Software part Recession looms? Outlook 3 iCSC2008, Andrzej Nowak, CERN openlab 4 iCSC2008, Andrzej Nowak, CERN openlab Towards Reconfigurable High-Performance Computing Lecture 2 iCSC 2008 3-5 March 2008, CERN Multi-core Architectures 1 Multi-core Architectures Multi-core Architectures Fundamentals of scalability Moore’s Law (1) Scalability – “readiness for enlargement” An observation made in 1965 by Gordon Moore, the co- founder of Intel Corporation: Good scalability: Additional
    [Show full text]
  • COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1)
    COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Marquette University Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike Johnson, Morgan Kaufmann, and other online sources Page 1 1 Outline Marquette University ❖Historical Perspective ❖Technology Trends ❖The Computing Stack: Layers of Abstraction ❖Performance Evaluation ❖Benchmarking Page 2 2 Page 1 Historical Perspective Marquette University ❖ ENIAC of WWII - first general purpose computer ◆ Used for computing artillery firing tables ◆ 80 feet long by 8.5 feet high; used 18,000 vacuum tubes ◆ Performed 1900 additions per second Page 3 3 Computing Devices Then… Marquette University Electronic Delay Storage Automatic Calculator (EDSAC) - early British computer University of Cambridge, UK, 1949 Considered to be the first stored program electronic computer Page 4 4 Page 2 Computing Systems Today Marquette University ❖The world is a large parallel system Massive Cluster ◆Microprocessors in everything Gigabit Ethernet Clusters ◆Vast infrastructure behind them Internet Scalable, Reliable, Connectivity Secure Services Databases Information Collection Remote Storage Online Games Sensor Commerce Refrigerators … Nets Cars MEMS for Sensor Nets Routers Robots Page 5 5 Technology Trends Marquette University ❖ Electronics technology evolution ◆ Increased capacity and performance ◆ Reduced cost Year Technology Relative performance/cost 1951 Vacuum tube 1 1965 Transistor 35 1975 Integrated circuit
    [Show full text]
  • A Survey of Microarchitectural Timing Attacks and Countermeasures on Contemporary Hardware
    A Survey of Microarchitectural Timing Attacks and Countermeasures on Contemporary Hardware Qian Ge1, Yuval Yarom2, David Cock1,3, and Gernot Heiser1 1Data61, CSIRO and UNSW, Australia, qian.ge, [email protected] 2Data61, CSIRO and the University of Adelaide, Australia, [email protected] 3Present address: Systems Group, Department of Computer Science, ETH Zurich,¨ [email protected] Abstract Microarchitectural timing channels expose hidden hardware state though timing. We survey recent attacks that exploit microarchitectural features in shared hardware, especially as they are relevant for cloud computing. We classify types of attacks according to a taxonomy of the shared resources leveraged for such attacks. Moreover, we take a detailed look at attacks used against shared caches. We survey existing countermeasures. We finally discuss trends in the attacks, challenges to combating them, and future directions, especially with respect to hardware support. 1 Introduction Computers are increasingly handling sensitive data (banking, voting), while at the same time we consolidate more services, sensitive or not, on a single hardware platform. This trend is driven both by cost savings and convenience. The most visible example is the cloud computing—infrastructure-as-a-service (IaaS) cloud computing supports multiple virtual machines (VM) on a hardware platform managed by a virtual machine monitor (VMM) or hypervisor. These co-resident VMs are mutually distrusting: high-performance computing software may run alongside a data centre for financial markets, requiring the platform to ensure confidentiality and integrity of VMs. The cloud computing platform also has availability requirements: service providers have service-level agreements (SLAs) which specify availability targets, and malicious VMs being able to deny service to a co-resident VM could be costly to the service provider.
    [Show full text]
  • Parallel Implementation of Sorting Algorithms
    IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012 ISSN (Online): 1694-0814 www.IJCSI.org 164 Parallel Implementation of Sorting Algorithms Malika Dawra1, and Priti2 1Department of Computer Science & Applications M.D. University, Rohtak-124001, Haryana, India 2Department of Computer Science & Applications M.D. University, Rohtak-124001, Haryana, India Abstract Usage of memory and other computer resources is also a factor in classifying the A sorting algorithm is an efficient algorithm, which sorting algorithms. perform an important task that puts elements of a list in a certain order or arrange a collection of elements into a Recursion: some algorithms are either particular order. Sorting problem has attracted a great recursive or non recursive, while others deal of research because efficient sorting is important to may be both. optimize the use of other algorithms such as binary search. This paper presents a new algorithm that will Whether or not they are a comparison sort split large array in sub parts and then all sub parts are examines the data only by comparing two processed in parallel using existing sorting algorithms elements with a comparison operator. and finally outcome would be merged. To process sub There are several sorting algorithms available to parts in parallel multithreading has been introduced. sort the elements of an array. Some of the sorting algorithms are: Keywords: sorting, algorithm, splitting, merging, multithreading. Bubble sort 1. Introduction In the bubble sort, as elements are sorted they gradually “bubble” (or rise) to their proper location Sorting is the process of putting data in order; either in the array.
    [Show full text]
  • IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196X 2016
    IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196x 2016 Investigation into Gang scheduling by integration of Cache in Multi core processors Authors Rupali1, Shailja Kumari2 1Research Scholar, Department of CSA, CDLU Sirsa 2Assistant Professor, Department of CSA, CDLU Sirsa, Email- [email protected] ABSTRACT Objective of research is increase efficiency of scheduling dependent task using enhanced multithreading. gang scheduling of parallel implicit-deadline periodic task systems upon identical multiprocessor platforms is considered. In this scheduling problem, parallel tasks use several processors simultaneously. first algorithm is based on linear programming & is first one to be proved optimal for considered gang scheduling problem. Furthermore, it runs in polynomial time for a fixed number m of processors & an efficient implementation is fully detailed. second algorithm is an approximation algorithm based on a fixed- priority rule that is competitive under resource augmentation analysis in order to compute an optimal schedule pattern. Precisely, its speedup factor is bounded by (2−1/m). In computer architecture, multithreading is ability of a central processing unit (CPU) or a single core within a multi-core processor to execute multiple processes or threads concurrently, appropriately supported by operating system. This approach differs from multiprocessing, as with multithreading processes & threads have to share resources of a single or multiple cores: computing units, CPU caches, & translation lookaside buffer (TLB). Multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using thread-level as well as instruction-level parallelism. Keywords: TLP, Response Time, Latency, throughput, multithreading, Scheduling threaded processor would switch execution to 1.
    [Show full text]
  • Simultaneous Multithreading on X86 64 Systems: an Energy Efficiency Evaluation
    Simultaneous Multithreading on x86_64 Systems: An Energy Efficiency Evaluation Robert Schöne Daniel Hackenberg Daniel Molka Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden, 01062 Dresden, Germany {robert.schoene, daniel.hackenberg, daniel.molka}@tu-dresden.de ABSTRACT Intel added simultaneous multithreading to the pop- In recent years, power consumption has become one ular x86 family with the Netburst microarchitecture of the most important design criteria for micropro- (e.g. Pentium 4) [6]. The brand name for this tech- cessors. CPUs are therefore no longer developed nology is Hyper-Threading (HT) [8]. It provides with a narrow focus on raw compute performance. two threads (logical processors) per core and still is This means that well-established processor features available in many Intel processors. that have proven to increase compute performance The primary objective of multithreading is to im- now need to be re-evaluated with a new focus on prove the throughput of pipelined and superscalar energy efficiency. This paper presents an energy ef- architectures. Working on multiple threads pro- ficiency evaluation of the symmetric multithreading vides more independent instructions and thereby (SMT) feature on state-of-the-art x86 64 proces- increases the utilization of the hardware as long as sors. We use a mature power measurement method- the threads do not compete for the same crucial re- ology to analyze highly sophisticated low-level mi- source (e.g., bandwidth). However, MT may also crobenchmarks as well as a diverse set of applica- decrease the performance of a system due to con- tion benchmarks. Our results show that{depending tention for shared hardware resources like caches, on the workload{SMT can be at the same time ad- TLBs, and re-order buffer entries.
    [Show full text]
  • (19) United States %
    US 20030172256A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2003/0172256 A1 Soltis, JR. et al. (43) Pub. Date: Sep. 11, 2003 (54) USE SENSE URGENCY TO CONTINUE Publication Classi?cation WITH OTHER HEURISTICS TO DETERMINE SWITCH EVENTS IN A (51) Im. c1? ..................................................... .. G06F 9/00 TEMPORAL MULTITHREADED CPU (52) US. Cl. ............................................................ ..712/228 (57) ABSTRACT (76) Inventors: Donald C. Soltis JR., Fort Collins, CO (US); Rohit Bhatia, Fort Collins, CO A processing unit of the invention has multiple instruction (Us) pipelines for processing multi-threaded instructions. Each thread may have an urgency associated With its program Correspondence Address: instructions. The processing unit has a thread sWitch con HEWLETT-PACKARD COMPANY troller to monitor processing of instructions through the Intellectual Property Administration various pipelines. The thread controller also controls sWitch P.O. Box 272400 events to move from one thread to another Within the Fort Collins, CO 80527-2400 (US) pipelines. The controller may modify the urgency of any thread such as by issuing an additional instruction. The (21) Appl. No.: 10/092,670 thread controller preferably utilizes certain heuristics in making sWitch event decisions. A time slice expiration unit (22) Filed: Mar. 6, 2002 may also monitor expiration of threads for a given time slice. 100 x 102 V ( MONITOR PROGRAM THREAD PROGESSION > A8 INsTRUCTTONs EROCEssED IN ~ NO SW?! MULTIPLE PIPELINES
    [Show full text]
  • PCAP Assignment IV.Pages
    PCAP Assignment IV 1. A. Write an OpenCL kernel code which converts input matrix A into output matrix B as follows: consider elements of matrix A into 4 equal parts. First part elements should be incremented by 1, Second part elements should be incremented by 2, Third part elements should be incremented by 3 and last part elements should be incremented by 4. Write the kernel code which does this in parallel for all the elements. Example, N = 4 3 8 2 5 4 9 4 7 2 3 5 6 3 4 7 8 2 4 3 1 5 7 7 5 3 2 1 5 6 5 5 9 __kernel void operate (__global int *A, __global int *B, int count) { int i = get_global_id(0); int j = get_global_id(1); // 'd' is the increment value int d = 1 + (i >= count/2) * 2 + (j >= count/2); B[i * count + j] = A[i * count + j] + d; } B. Write OpenCL kernel code to calculate the value of Π. __kernel void advance ( __global float *output, const unsigned int count) { int n_rects = get_global_id(0); float rect_width = 1.0 / n_rects; float rect_left = 0.0; float rect_height_squared; float rect_height; float agg_area = 0.0; float pi; int i; if (n_rects < count) { for (i = 0; i < n_rects; i++) { rect_left += rect_width; rect_height_squared = 1 - rect_left * rect_left; if (rect_height_squared < 0.00001 ) { rect_height_squared = 0; } rect_height = sqrt(rect_height_squared); agg_area += rect_width * rect_height; } pi = 4 * agg_area; output[n_rects] = pi; } } 2. A. Write a parallel program in CUDA to multiply two Matrices A and B of dimensions MxN and NxP resulting in Matrix C of dimension MxP.
    [Show full text]
  • UC Riverside UC Riverside Previously Published Works
    UC Riverside UC Riverside Previously Published Works Title Accelerating in-memory database selections using latency masking hardware threads Permalink https://escholarship.org/uc/item/7tz401px Journal ACM Transactions on Architecture and Code Optimization, 16(2) ISSN 1544-3566 Authors Budhkar, P Absalyamov, I Zois, V et al. Publication Date 2019-04-01 DOI 10.1145/3310229 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Accelerating In-Memory Database Selections Using Latency Masking Hardware Threads PRERNA BUDHKAR, ILDAR ABSALYAMOV, VASILEIOS ZOIS, SKYLER WINDH, WALID A. NAJJAR, and VASSILIS J. TSOTRAS, University of California, Riverside, USA Inexpensive DRAMs have created new opportunities for in-memory data analytics. However, the major bot- tleneck in such systems is high memory access latency. Traditionally, this problem is solved with large cache hierarchies that only benefit regular applications. Alternatively, many data-intensive applications exhibit ir- regular behavior. Hardware multithreading can better cope with high latency seen in such applications. This article implements a multithreaded prototype (MTP) on FPGAs for the relational selection operator that ex- hibits control flow irregularity. On a standard TPC-H query evaluation, MTP achieves a bandwidth utilization of 83%, while the CPU and the GPU implementations achieve 61% and 64%, respectively. Besides being band- width efficient, MTP is also× 14.2 and 4.2× more power efficient than CPU and GPU, respectively. CCS Concepts: • Hardware → Hardware accelerators;•Information systems → Query operators; • Computer systems organization → Multiple instruction, multiple data; Reconfigurable computing; Additional Key Words and Phrases: Hardware multithreading, database, selection operator, FPGA accelerator ACM Reference format: Prerna Budhkar, Ildar Absalyamov, Vasileios Zois, Skyler Windh, Walid A.
    [Show full text]
  • Heterogeneous Computing with Opencl 2.0 This Page Intentionally Left Blank Heterogeneous Computing with Opencl 2.0 Third Edition
    Heterogeneous Computing with OpenCL 2.0 This page intentionally left blank Heterogeneous Computing with OpenCL 2.0 Third Edition David Kaeli Perhaad Mistry Dana Schaa Dong Ping Zhang AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier Acquiring Editor: Todd Green Editorial Project Manager: Charlie Kent Project Manager: Priya Kumaraguruparan Cover Designer: Matthew Limbert Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright © 2015, 2013, 2012 Advanced Micro Devices, Inc. Published by Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
    [Show full text]