Considerations in the Design of Hydra: a Multiprocessor-On-A-Chip Microarchitecture

Total Page:16

File Type:pdf, Size:1020Kb

Considerations in the Design of Hydra: a Multiprocessor-On-A-Chip Microarchitecture CONSIDERATIONS IN THE DESIGN OF HYDRA: A MULTIPROCESSOR-ON-A-CHIP MICROARCHITECTURE Lance Hammond Kunle Olukotun Technical Report No.: CSL-TR-98-749 February 1998 This work was supported by DARPA contract DABT63-95-C-0089. Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture Lance Hammond and Kunle Olukotun CSL-TR-98-749 February 1998 Computer Systems Laboratory Departments of Electrical Engineering and Computer Science Stanford University Gates Computer Science Building, #408 Stanford, CA 94305-9040 [email protected] Abstract As more transistors are integrated onto larger dies, single-chip multiprocessors integrated with large amounts of cache memory will soon become a feasible alternative to the large, monolithic uniprocessors that dominate today’s microprocessor marketplace. Hydra offers a promising way to build a small-scale MP-on-a-chip using a fairly simple design that still maintains excellent performance on a wide variety of applications. This report examines key parts of the Hydra design — the memory hierarchy, the on-chip buses, and the control and arbitration mechanisms — and explains the rationale for some of the decisions made in the course of finalizing the design of this memory system, with particular emphasis given to applications that stress the memory system with numerous memory accesses. With the balance between complexity and performance that we obtain, we feel Hydra offers a prom- ising model for future MP-on-a-chip designs. Keywords & Phrases: microprocessors, multiprocessors, MP-on-a-chip (CMP), cache memories, cache coherence, SimOS, SUIF i Copyright © 1997, 1998. Lance Hammond and Kunle Olukotun Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 email: [email protected], [email protected], http://www-hydra.stanford.edu 1. Introduction ferent parts of memory are described in section 4. Control mechanisms, including the resource and address arbiters, are The Hydra microarchitecture is a research vehicle currently described briefly in section 5. Finally, section 6 concludes. being designed at Stanford in an effort to evaluate the concept of a multiprocessor on a chip as an alternative for future mi- croprocessor development, when large numbers of transistors 2. Simulation Methodology and Results and RAM may be integrated on a single chip. We have previ- Hydra is currently being evaluated using a sophisticated, cycle- ously demonstrated the value of this approach, in a general- accurate memory simulator written in C++ that is interfaced ized processor [1]. This technical report provides a more de- with the SimOS machine simulator [7]. SimOS allows us to tailed view of the system we are currently designing. simulate four fully functional MIPS-II ISA CPUs and a suite of I/O devices with enough realism to boot and execute the Hydra is composed of four 2-way superscalar MIPS CPUs on IRIX 5.3 operating system under our tested applications. As a a single chip, each similar to a small R10000 processor [2] result, system calls and I/O in our benchmarks were simu- with individual L1 instruction and data caches attached. A lated with exceptional realism. Hydra simulates the memory single, unified L2 cache supports the individual L1 caches and system using a group of interconnected state machine con- provides a path for rapid and simple communication between trollers to evaluate the memory’s response to all memory ref- the processors. These two levels of the memory hierarchy erences, both instruction and data, supplied by SimOS. These and the bus interconnections between them are the focus of controllers communicate through shared models of the cen- the Hydra design effort described here. However, the design tral arbitration mechanisms and the caches in order to accu- would be incomplete without a state-of-the-art off-chip rately model the time required to complete all accesses. memory system. On one side of the chip, a 128-bit L3 cache bus attaches to an array of high-speed cache SRAMs, while This paper focuses on describing the Hydra hardware qualita- on the other side Rambus channels directly connect main tively, but some key numbers from two applications that we memory to Hydra and a more conventional bus connects Hy- evaluated are used throughout the text to illustrate the ratio- dra to I/O devices. Fig. 1 shows a diagram of the architecture, nale for key design features of Hydra. The numbers of inter- while fig. 2 depicts a possible layout for the completed de- est are plotted in figs. 3–6. Representative samples from the sign. core loops of the swim and tomcatv SPEC95FP benchmarks, parallelized automatically using the SUIF compiler system [8], This paper gives a brief overview of the microarchitecture and were executed on the simulator to get these results. While we attempts to describe some of the trade-offs that have been have examined several other applications from the SPEC suite, evaluated in the course of revising the design. Section 2 gives these two have exhibited the most interesting memory system a brief overview of the simulation environment we are using behavior because they stress the memory system with large to evaluate Hydra, and presents a few of our most interesting numbers of accesses. In contrast, the Hydra memory system’s results obtained through simulation. Section 3 presents a de- cache hierarchy easily handles the small data sets of the other scriptive overview of Hydra’s memory hierarchy, along with SPEC benchmarks. In the future, we may examine other ap- a qualitative view of how the different levels interact. The plications with large data sets, such as databases. communication buses used to transmit data between the dif- 1 Centralized Bus Arbitration Mechanisms CPU 0 CPU 1 CPU 2 CPU 3 L1 Inst. L1 Data L1 Inst. L1 Data L1 Inst. L1 Data L1 Inst. L1 Data Cache Cache Cache Cache Cache Cache Cache Cache CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller Cache Writes out, Cache Writes out, Cache Writes out, Cache Writes out, refills invalidates in refills invalidates in refills invalidates in refills invalidates in Write-through Bus (64b) Read/Replace Bus (256b) Direct I/O Rambus Memory DMA On-chip L2 Cache Off-chip L3 Interface Interface I/O Bus Interface The Edge of the Chip Cache SRAM DRAM Main I/O Devices Array Memory Figure 1: A schematic overview of Hydra L3 Data Port “A” Clock & Test Ports FPU #0 FPU #1 Test L2 CPU CPU Tags #0 #1 Clocking & L2 Cache Data PMMU #0 PMMU #1 #0 #0 #1 #1 I-Cache D-Cache I-Cache D-Cache CPU0 Memory Address CPU1 Memory Control Jammer Control Memory Bus Wiring Channel Controller CPU2 Memory Central CPU3 Memory Main Memory RamBus Interface L2 / L3 Cache Control Arbiter Control RamBus Interface Controller #2 #2 #3 #3 L3 Tag and Address Port I-Cache D-Cache I-Cache D-Cache L2 Cache PMMU #2 PMMU #3 L2 Data CPU CPU Tags #2 #3 Control Unit System Bus FPU #2 FPU #3 I/O System Bus L3 Data Port “B” Figure 2: A possible layout of Hydra 2 100 16 90 swim 14 swim 80 tomcatv 12 tomcatv 70 60 10 50 8 40 6 30 % of time Occupied 4 20 10 % slowdown from “perfect” case 2 0 0 1-Port Dual Bus 1-Port L2 2-Port L2 2-Bus (R) 2-Port (R) 2-Bus (W) 2-Port (W) Single Bus Single Bus 1-Way SA L3 2-Way SA L3 1-Way SA L3 2-Way SA L3 L2 Ports L3 Associativity # of buses L2 Ports L3 Associativity # of buses Figure 3: Resource occupancies in Hydra for several simu- Figure 4: The % increase in execution time seen as a result of lated situations. (see section 3 for the L2/L3 cache discus- several architectural decisions, over “perfect” scenarios. sion and section 4 for discussion of the buses) (see section 3 for the L2/L3 cache discussion and section 4 for discussion of the buses) 60 swim 50 tomcatv 40 30 Pipelined L2 Only Pipelined L2 & L3 All Pipelined Roughly Pipelined 0.5 20 0 10 % slowdown from “perfect” case -0.5 0 -1 swim -1.5 tomcatv -2 % slowdown from “All Pipelined” case Pipelined / old arbiter -2.5 Pipelined / new arbiter Address Jamming Only Unpipelined / old arbiter Figure 5: The % increase in execution time, over “perfect” Figure 6: The % increase in execution time seen as a result of scenarios with no arbitration at all between fully pipelined variations in the degree of pipelining used in the read and state machines, seen as a result of variations in the arbitra- write control state machines. (see section 5 for discussion) tion mechanisms and pipelining of the read and write con- trol state machines. (see section 5 for discussion) 3 Le1 Cache Le2CachLy3CachMainMemor SeparateI&D Shared,on-chipShared,off-chip ConfigurationSRAMcachepairs Off-chipDRAM SRAMcacheSRAMcache foreachCPU Capacity1B6KBeach5B12K8BM128M 128-bitsynchronous32-bitRambus 64-bitconnectionto256-bitreadbus+64- BusWidth SRAM(runningat(runningatthefull CPUbitwritebus* halftheCPUspeed)CPUspeed) AccessTime1sCPUcycle5dCPUcycle1s0cyclestofirstworatleast50cycle Associativity4y-way4*-wa2A-wayN/ LineSize3s2bytes3s2byte6s4byte4KBpage Writethrough,noWriteback,allocateon“writeback”(virtual WritePolicy Writeback allocateonwritewrites memorypaging) InclusionnotenforcedIncludesalldatainL1Includesallcached InclusionN/A byL2onL1cachesand/orL2cachesdata Table 1: A summary of the cache hierarchy. The two entries marked with asterisks, L2 cache bus width and L3 cache associa- tivity, were varied during the experiments documented in figures 2 and 3. A third experiment, comparing a single L2 port against a pair of separate read and write ports, is also documented in the graphs but is not reflected in this table, since all other levels of memory are strictly single-ported. 3. The Memory Hierarchy fairly sophisticated arbitration logic between the L1 cache and the processors.
Recommended publications
  • Spreading Excellence Report
    HIGH PERFORMANCE AND EMBEDDED ARCHITECTURE AND COMPILATION Project Acronym: HiPEAC Project full title: High Performance and Embedded Architecture and Compilation Grant agreement no: ICT-217068 DELIVERABLE 3.1 SPREADING EXCELLENCE REPORT 15/04/2009 Spreading Excellence Report 1 1. Summary on Spreading Excellence ................................................................................ 3 2. Task 3.1: Conference ........................................................................................................ 4 2.1. HiPEAC 2008 Conference, Goteborg ......................................................................... 4 2.2. HiPEAC 2009 Conference, Paphos ........................................................................... 10 2.3. Conference Ranking .................................................................................................. 16 3. Task 3.2: Summer School .............................................................................................. 16 th 3.1. 4 International Summer School – 2008 ................................................................... 16 th 3.2. 5 International Summer School – 2009 ................................................................... 21 4. Task 3.3: HiPEAC Journal ............................................................................................ 22 5. Task 3.4: HiPEAC Roadmap ........................................................................................ 23 6. Task 3.5: HiPEAC Newsletter ......................................................................................
    [Show full text]
  • A Compiler-Compiler for DSL Embedding
    A Compiler-Compiler for DSL Embedding Amir Shaikhha Vojin Jovanovic Christoph Koch EPFL, Switzerland Oracle Labs EPFL, Switzerland {amir.shaikhha}@epfl.ch {vojin.jovanovic}@oracle.com {christoph.koch}@epfl.ch Abstract (EDSLs) [14] in the Scala programming language. DSL devel- In this paper, we present a framework to generate compil- opers define a DSL as a normal library in Scala. This plain ers for embedded domain-specific languages (EDSLs). This Scala implementation can be used for debugging purposes framework provides facilities to automatically generate the without worrying about the performance aspects (handled boilerplate code required for building DSL compilers on top separately by the DSL compiler). of extensible optimizing compilers. We evaluate the practi- Alchemy provides a customizable set of annotations for cality of our framework by demonstrating several use-cases encoding the domain knowledge in the optimizing compila- successfully built with it. tion frameworks. A DSL developer annotates the DSL library, from which Alchemy generates a DSL compiler that is built CCS Concepts • Software and its engineering → Soft- on top of an extensible optimizing compiler. As opposed to ware performance; Compilers; the existing compiler-compilers and language workbenches, Alchemy does not need a new meta-language for defining Keywords Domain-Specific Languages, Compiler-Compiler, a DSL; instead, Alchemy uses the reflection capabilities of Language Embedding Scala to treat the plain Scala code of the DSL library as the language specification. 1 Introduction A compiler expert can customize the behavior of the pre- Everything that happens once can never happen defined set of annotations based on the features provided by again.
    [Show full text]
  • (URMD) Grad Cohort Workshop
    MARCH 16-17, 2018 SAN DIEGO GRAD Cohort UnderrepresentedURMD Minorities & Persons with Disabilities 2018 www.cra.org Dear Grad Cohort Participant, We welcome you to the 2018 CRA Grad Cohort Workshop for Underrepresented Minorities + Persons with Disabilities 2018 (URMD)! The next few days are filled with sessions where 25 senior computing researchers and professionals will be sharing their strategies and experiences to help increase GradCohort your graduate school and career success. There will also be plenty of opportunities to meet and network with these successful researchers as well as with graduate students from other universities. We hope that you will take the utmost advantage of this unique experience by actively participating in discussions, developing peer networks, and building mentoring relationships. Since this is the inaugural URMD Grad Cohort Workshop, we are especially interested in hearing your feedback about the program and the experience. Please take time to complete the evaluation form provided after the workshop. We want to learn what you liked and did not like, as well as any suggestions you might have for improving the event in subsequent years. The 2018 CRA-URMD Workshop is made possible through generous contributions by the Computing Research Association, National Science Foundation, AccessComputing, Whova, Google and Association for Computing Machinery. Please join us in thanking them for their kind support. We hope that you take home many new insights and connections from this workshop to help you succeed in
    [Show full text]
  • Entrepreneurship Opportunities & Skills
    Entrepreneurship Opportunities & Skills Kunle Olukotun Stanford University 2019 CRA URMD Grad Cohort Workshop Kunle Olukotun • Professor of EE and CS at Stanford since 1992 – Cadence Design endowed chair • Computer architecture and parallel computing • Pioneer in CMP (multicore) processor design – Stanford Hydra project • Parallel programming for all programmers – Director of Pervasive Parallelism Lab (PPL) – High-performance Domain Specific Languages (DSLs) • Data Analytics for What’s Next (DAWN) – Democratizing machine learning Academic Research and Startups • When you can’t find an industrial partner for your ideas make one • Technology transfer by PhD – Understand the initial market for your product (technology) – What is your unfair advantage – More robust technology than research prototypes – Cost of schedule slip is much worse – You don’t have much time for research at a start-up • Pick your team wisely – Technical team (lab partners?) – Management team (who is making decisions) – Cap table • Funding issues – Investors (smart vs. dumb) – Angels vs Venture Capital firms Startup Pros/Cons Pros Cons • Don’t have to convince existing company to change • Have to convince investors course (until exit) (repeatedly) • Have to build a whole company, not just a development team • Finance, sales, marketing, … • Limited resources • Impatient capital Afara WebSystems • Founded in 1999 – Height of internet boom – Large web sites running out of power and space – Goal: Revolutionize internet data centers (multi-B $ market) – Goal: Approach: 10x
    [Show full text]
  • Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory
    Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305 {wkbaek,nbronson,kozyraki,kunle}@stanford.edu ABSTRACT 1. INTRODUCTION Transactional Memory (TM) is a promising technique that sim- Transactional Memory (TM) [13] has surfaced as a promising plifies parallel programming for shared-memory applications. To technique to simplify parallel programming. TM addresses the dif- date, most TM systems have been designed to efficiently support ficulty of lock-based synchronization by allowing programmers to single-level parallelism. To achieve widespread use and maximize simply declare certain code segments as transactions that execute performance gains, TM must support nested parallelism available in an atomic and isolated way with respect to other code. TM takes in many applications and supported by several programming mod- responsibility for all concurrency control. The potential of TM els. has motivated extensive research on hardware, software, and hy- We present NesTM, a software TM (STM) system that supports brid implementations. We focus on software TM (STM) [8,11,19], closed-nested parallel transactions. NesTM is based on a high- because it is the only approach compatible with existing and up- performance, blocking STM that uses eager version management coming multicore chips. and word-granularity conflict detection. Its algorithm targets the Most TM systems, thus far, have assumed that the code within state and runtime overheads of nested parallel transactions. We a transaction executes sequentially. However, real world applica- also describe several subtle correctness issues in supporting nested tions often include the potential for nested parallelism in various parallel transactions in NesTM and discuss their performance im- forms such as nested parallel loops, recursive function calls, and pact.
    [Show full text]
  • Ecpe Connections
    alumni newsletter fall 2 011 volume 28 issue 1 ecpe connections alum helps bring trees to 9/11 memorial plaza more inside ... • computer engineering student attends google camp • faculty makes breakthroughs in bandwidth recycling • alum becomes new ecpe external advisory board leader www.ece.iastate.edu learn invent impact department news letter from the chair in this issue am pleased to be able to share exciting news about the growth department news ........................ 3-6 and continued success of the Department of Electrical and Com- new faculty and staff • distinguished lecture puter Engineering (ECpE) and report on the accomplishments of I series • new bio-electrical engineering our students, faculty and staff, and alumni, as I begin my second course • faculty and staff recognitions • year as the Palmer Department Chair in Electrical and Computer new online graduate certificate Engineering. With 1,259 undergraduate students enrolled in our electrical, alumni news .............................. 7-10 computer, and software engineering programs, our department is murray harpole’s legacy remembered • in contributing to Iowa State University’s enrollment memoriam • new leader of external advisory Fall Student Enrollment being at an all-time high this fall. Our graduate board • advisory board members • alumni programs also are maintaining strong enrollment, with awards • alum helps bring trees to 9/11 335 students in master’s and PhD programs. Graduate memorial plaza We continue to develop our curriculum to meet Students: 335 the needs of our growing population, and have student news ......................... 12-14 introduced a new bio-electrical engineering course for Undergraduate information assurance student group • wind undergraduates and a new online graduate certificate Students: 1,259 energy research summer program • student in nondestructive evaluation (page 6).
    [Show full text]
  • Kunle Olukotun Cadence Design Systems Professor and Professor of Electrical Engineering
    Kunle Olukotun Cadence Design Systems Professor and Professor of Electrical Engineering CONTACT INFORMATION • Administrative Contact Kathy Robinson - Administrative Associate Email [email protected] Tel (650) 723-1430 Bio BIO Kunle Olukotun is the Cadence Design Systems Professor in the School of Engineering and Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is well known as a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. Olukotun founded Afara Websystems to develop high-throughput, low-power multicore processors for server systems. The Afara multicore processor, called Niagara, was acquired by Sun Microsystems. Niagara derived processors now power all Oracle SPARC-based servers. Olukotun currently directs the Stanford Pervasive Parallelism Lab (PPL), which seeks to proliferate the use of heterogeneous parallelism in all application areas using Domain Specific Languages (DSLs). ACADEMIC APPOINTMENTS • Professor, Electrical Engineering • Professor, Computer Science • Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI) • Member, Wu Tsai Neurosciences Institute HONORS AND AWARDS • Fellow, ACM (2007) • Fellow, IEEE (2007) PROFESSIONAL EDUCATION • PhD, Michigan (1991) LINKS • Personal Site: http://arsenalfc.stanford.edu/kunle/ Teaching COURSES 2021-22 Page 1 of 2 Kunle Olukotun http://cap.stanford.edu/profiles/Oyekunle_Olukotun/ • Digital Systems Design Lab: EE 109 (Spr) • Parallel Computing: CS 149
    [Show full text]
  • Energy-Efficient Abundant-Data Computing: the N3XT 1,000X
    COVER FEATURE REBOOTING COMPUTING Energy-Efficient Abundant-Data Computing: The N3XT 1,000× Mohamed M. Sabry Aly, Mingyu Gao, Gage Hills, Chi-Shuen Lee, Greg Pitner, Max M. Shulaker, Tony F. Wu, and Mehdi Asheghi, Stanford University Jeff Bokor, University of California, Berkeley Franz Franchetti, Carnegie Mellon University Kenneth E. Goodson and Christos Kozyrakis, Stanford University Igor Markov, University of Michigan, Ann Arbor Kunle Olukotun, Stanford University Larry Pileggi, Carnegie Mellon University Eric Pop, Stanford University Jan Rabaey, University of California, Berkeley Christopher Ré, H.-S. Philip Wong, and Subhasish Mitra, Stanford University Next-generation information technologies will process unprecedented amounts of loosely structured data that overwhelm existing computing systems. N3XT improves the energy efficiency of abundant-data applications 1,000-fold by using new logic and memory technologies, 3D integration with fine-grained connectivity, and new architectures for computation immersed in memory. 24 COMPUTER PUBLISHED BY THE IEEE COMPUTER SOCIETY 0018-9162/15/$31.00 © 2015 IEEE he rising demand for high- new system technology that promises enabled by low-temperature performance IT services with to breathe new life into computing. layer transfer techniques. This human-like interfaces is driv- Key N3XT components include the unique approach decouples ing the quest for the next gen- following: high-temperature nanoma- Teration of energy-efficient computers. terial synthesis (to achieve These computers will operate on abun- › High-performance and energy- high- quality materials) from dant data that can be highly unstruc- efficient field-effect transistors low-temperature monolithic 3D tured and often streamed in terabytes. (FETs) based on atomic-scale integration.
    [Show full text]
  • News from SCS Networks
    SCS Newsletter | March 2016 News from SCS Networks IN THIS ISSUE AUSTECH 2015, THE FIRST AUST INTERNATIONAL CONFERENCE News from IN TECHNOLOGY 01 SCS Networks Mamadou Kaba Traoré Upcoming Conferences Department of Mathematics & Computer Science 03 African University of Science & Technology (AUST) The African University of Science and Technology (AUST) International Conference in Technology 2015 (AUSTECH’15) was conducted October 12-14, 2015, on AUST campus at Abuja, Nigeria. AUSTECH is an annual event by AUST which focuses on current developments in Engineering technologies, scientific and industrial applications for development in Sub-Saharan Africa. It has a multi-Conference format, with 3 symposia reflecting the core technological areas developed at AUST: • Petroleum Engineering Symposium (PES), chaired this year by D. OGBE, AUST (Nigeria). • Materials Science and Engineering Symposium (MSES), chaired this year by P. ONWUALU, AUST (Nigeria). • Computer Science Symposium (CSS), chaired this year by M.K. TRAORE, AUST and Blaise Pascal University (France), who also chaired the overall multi-conference. AUSTECH’15 program included a world-class selection of invited speakers and panels. It also promoted a Ph.D. Colloquium and Poster sessions, fora where opportunity was given to students to showcase their work in progress and link with their peers. Many people deserve credits for the success of AUSTECH’15. This includes the organization teams in the symposia and the overall multi- 2 conference, the authors, the reviewers, the panelists, and attendees who contributed to memorable cross-fertilization. The conference has Theme 2: MOBILE & TELECOM been sponsored by the ACBF (African Capacity Building Foundation). • Advances in Video Production, Application Generation and A special emphasis has to be done here on the Computer Science Telecommunications, contribution of Michael Adeyeye, University of Symposium, which focused on Theories, Technologies and Applications West Australia.
    [Show full text]
  • Multicore Cpus: Processor Proliferation - IEEE Spectrum 2/15/11 1:51 PM
    Multicore CPUs: Processor Proliferation - IEEE Spectrum 2/15/11 1:51 PM SEMICONDUCTORS / PROCESSORS FEATURE Multicore CPUs: Processor Proliferation From multicore to many-core to hard-to-describe-in-a-single-word core By SAMUEL K. MOORE / JANUARY 2011 This is part of IEEE Spectrum's special report: Top 11 Technologies of the Decade Back in 1994, programmers figured that whatever code they wrote would run at least 50 percent faster on a 1995 machine and 50 percent faster still on a '96 system. Coding would continue as it always had, with instructions designed to be executed one after the other. But Kunle Olukotun, then a newly minted professor of electrical engineering at Stanford, saw that the party couldn't go on forever. The microprocessors of the day couldn't scale up as efficiently as you'd expect through the mere addition of ever more and ever faster transistors, the two things that Moore's Law provided. To solve that problem, Olukotun and his students designed the first general-purpose multicore CPU. This idea, more than any other in the past decade, is what has kept the semiconductor industry climbing the Moore's Law performance curve. Without multicore chips, the computing capability of everything from Illustration: Frank Chimero servers to netbooks would not be much better than it was a decade ago. Everyone's happy—except perhaps for the programmers, who must now write code with threads of instructions that must be executed together—in pairs, quartets, or even larger groupings. It's not that old, single-core CPUs weren't already doing some parallel processing.
    [Show full text]
  • University of Copenhagen
    Graph Processing on GPUs A Survey Shi, Xuanhua; Zheng, Zhigao; Zhou, Yongluan; Jin, Hai; He, Ligang; Liu, Bo; Hua, Qiang- Sheng Published in: A C M Computing Surveys DOI: 10.1145/3128571 Publication date: 2018 Document version Peer reviewed version Citation for published version (APA): Shi, X., Zheng, Z., Zhou, Y., Jin, H., He, L., Liu, B., & Hua, Q-S. (2018). Graph Processing on GPUs: A Survey. A C M Computing Surveys, 50(6), [81]. https://doi.org/10.1145/3128571 Download date: 29. Sep. 2021 0 Graph Processing on GPUs: A Survey Xuanhua Shi, Services Computing Technology and System Lab/Big Data Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, China Zhigao Zheng, Services Computing Technology and System Lab/Big Data Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, China Yongluan Zhou, Department of Computer Science, University of Copenhagen, Denmark Hai Jin, Services Computing Technology and System Lab/Big Data Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, China Ligang He, Department of Computer Science,University of Warwick,United Kingdom Bo Liu, Services Computing Technology and System Lab/Big Data Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, China Qiang-Sheng Hua, Services Computing Technology and System Lab/Big Data Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, China In the big data era, much real-world data can be naturally represented as graphs.
    [Show full text]
  • 10003.Demeyerromain.2604.Pdf (0.1
    Technical Communications of the International Conference on Logic Programming, 2010 (Edinburgh), pp. 248–254 http://www.floc-conference.org/ICLP-home.html PROGRAM ANALYSIS TO SUPPORT CONCURRENT PROGRAMMING IN DECLARATIVE LANGUAGES ROMAIN DEMEYER University of Namur - Faculty of Computer Science Rue Grandgagnage 21, 5000 Namur (Belgium) E-mail address: [email protected] Abstract. In recent years, manufacturers of processors are focusing on parallel archi tectures in order to increase performance. This shift in hardware evolution is provoking a fundamental turn towards concurrency in software development. Unfortunately, de veloping concurrent programs which are correct and efficient is hard, as the underlying programming model is much more complex than it is for simple sequential programs. The goal of this research is to study and to develop program analysis to support and improve concurrent software development in declarative languages. The characteristics of these lan guages offer opportunities, as they are good candidates for building concurrent applications while their simple and uniform data representation, together with a small and formally defined semantics makes them welladapted to automatic program analysis techniques. In our work, we focus primarily on developing static analysis techniques for detecting race conditions at the application level in Mercury and Prolog programs. A further step is to derive (semi) automatically the location and the granularity of the critical sections using a datacentric approach. 1. Introduction and Problem Description Since the mid-70s, the power of the microprocessor, which is the basic component of the computer responsible for instruction execution and data processing, has increased constantly. For decades, we have witnessed a dramatic and continuous growth of clock speed, which is one of the main factors determining the performance of processors [Olu05].
    [Show full text]