CAREER & TECHNOLOGY EDUCATION Animal Science
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Lecture 7: Synchronous Sequential Logic
Systems I: Computer Organization and Architecture Lecture 7: Synchronous Sequential Logic Synchronous Sequential Logic • The digital circuits that we have looked at so far are combinational, depending only on the inputs. In practice, most systems have many components that contain memory elements, which require sequential logic. • A sequential circuit contains both a combinational circuit and memory elements, whose output also serves as an input for the combinational circuit. • The binary data stored in the memory elements define the state of the sequential circuit and help determine the conditions for changing the state in the memory elements. Inputs Combinational Outputs circuit Memory elements Sequential Circuits: Synchronous and Asynchronous • There are two types of sequential circuits, classified by their signal’s timing: – Synchronous sequential circuits have behavior that is defined from knowledge of its signal at discrete instants of time. – The behavior of asynchronous sequential circuits depends on the order in which input signals change and are affected at any instant of time. Synchronous Sequential Circuits • Synchronous sequential circuits need to use signal that affect memory elements at discrete time instants. • Synchronization is achieved by a timing device called a master-clock generator which generates a periodic train of clock pulses. Basic Flip-Flop Circuit Using NOR Gates 1 R(reset) Q 0 Set state Clear state 1 Q’ 0 S(set) S R Q Q’ 1 0 1 0 0 0 1 0 (after S = 1, R = 0) 0 1 0 1 0 0 0 1 (after S = 0, R = 1) 1 1 0 0 Basic -
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
Computer Hardware Architecture Lecture 4
Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data. -
Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g. -
Computer Architecture: Parallel Processing Basics
Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions, -
Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!
CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over next 2 weeks)! Lecture 23" Introduction to Parallel Processing! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 3! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 4! Processor components! Multicore processors and programming! Processor comparison! vs.! Goal: Explain and articulate why modern microprocessors now have more than one core andCSE how software 30321 must! adapt to accommodate the now prevalent multi- core approach to computing. " Introduction and Overview! Writing more ! efficient code! The right HW for the HLL code translation! right application! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 6! Pipelining and “Parallelism”! ! Load! Mem! Reg! DM! Reg! ALU ! Instruction 1! Mem! Reg! DM! Reg! ALU ! Instruction 2! Mem! Reg! DM! Reg! ALU ! Instruction 3! Mem! Reg! DM! Reg! ALU ! Instruction 4! Mem! Reg! DM! Reg! ALU Time! Instructions execution overlaps (psuedo-parallel)" but instructions in program issued sequentially." University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! Multiprocessing (Parallel) Machines! Flynn#s -
Performance Evaluation of a Signal Processing Algorithm with General-Purpose Computing on a Graphics Processing Unit
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2019 Performance Evaluation of a Signal Processing Algorithm with General-Purpose Computing on a Graphics Processing Unit FILIP APPELGREN MÅNS EKELUND KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Performance Evaluation of a Signal Processing Algorithm with General-Purpose Computing on a Graphics Processing Unit Filip Appelgren and M˚ansEkelund June 5, 2019 2 Abstract Graphics Processing Units (GPU) are increasingly being used for general-purpose programming, instead of their traditional graphical tasks. This is because of their raw computational power, which in some cases give them an advantage over the traditionally used Central Processing Unit (CPU). This thesis therefore sets out to identify the performance of a GPU in a correlation algorithm, and what parameters have the greatest effect on GPU performance. The method used for determining performance was quantitative, utilizing a clock library in C++ to measure performance of the algorithm as problem size increased. Initial problem size was set to 28 and increased exponentially to 221. The results show that smaller sample sizes perform better on the serial CPU implementation but that the parallel GPU implementations start outperforming the CPU between problem sizes of 29 and 210. It became apparent that GPU's benefit from larger problem sizes, mainly because of the memory overhead costs involved with allocating and transferring data. Further, the algorithm that is under evaluation is not suited for a parallelized implementation due to a high amount of branching. Logic can lead to warp divergence, which can drastically lower performance. -
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C. -
An Introduction to CUDA/Opencl and Graphics Processors
An Introduction to CUDA/OpenCL and Graphics Processors Bryan Catanzaro, NVIDIA Research Overview ¡ Terminology ¡ The CUDA and OpenCL programming models ¡ Understanding how CUDA maps to NVIDIA GPUs ¡ Thrust 2/74 Heterogeneous Parallel Computing Latency Throughput Optimized CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/74 Latency vs. Throughput Latency Throughput ¡ Latency: yoke of oxen § Each core optimized for executing a single thread ¡ Throughput: flock of chickens § Cores optimized for aggregate throughput, deemphasizing individual performance ¡ (apologies to Seymour Cray) 4/74 Latency vs. Throughput, cont. Specificaons Sandy Bridge- Kepler EP (Tesla K20) 8 cores, 2 issue, 14 SMs, 6 issue, 32 Processing Elements 8 way SIMD way SIMD @3.1 GHz @730 MHz 8 cores, 2 threads, 8 14 SMs, 64 SIMD Resident Strands/ way SIMD: vectors, 32 way Threads (max) SIMD: Sandy Bridge-EP (32nm) 96 strands 28672 threads SP GFLOP/s 396 3924 Memory Bandwidth 51 GB/s 250 GB/s Register File 128 kB (?) 3.5 MB Local Store/L1 Cache 256 kB 896 kB L2 Cache 2 MB 1.5 MB L3 Cache 20 MB - Kepler (28nm) 5/74 Why Heterogeneity? ¡ Different goals produce different designs § Manycore assumes work load is highly parallel § Multicore must be good at everything, parallel or not ¡ Multicore: minimize latency experienced by 1 thread § lots of big on-chip caches § extremely sophisticated control ¡ Manycore: maximize throughput of all threads § lots of big ALUs § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over -
Lecture 34: Bonus Topics
Lecture 34: Bonus topics Philipp Koehn, David Hovemeyer December 7, 2020 601.229 Computer Systems Fundamentals Outline I GPU programming I Virtualization and containers I Digital circuits I Compilers Code examples on web page: bonus.zip GPU programming 3D graphics Rendering 3D graphics requires significant computation: I Geometry: determine visible surfaces based on geometry of 3D shapes and position of camera I Rasterization: determine pixel colors based on surface, texture, lighting A GPU is a specialized processor for doing these computations fast GPU computation: use the GPU for general-purpose computation Streaming multiprocessor I Fetches instruction (I-Cache) I Has to apply it over a vector of data I Each vector element is processed in one thread (MT Issue) I Thread is handled by scalar processor (SP) I Special function units (SFU) Flynn’s taxonomy I SISD (single instruction, single data) I uni-processors (most CPUs until 1990s) I MIMD (multi instruction, multiple data) I all modern CPUs I multiple cores on a chip I each core runs instructions that operate on their own data I SIMD (single instruction, multiple data) I Streaming Multi-Processors (e.g., GPUs) I multiple cores on a chip I same instruction executed on different data GPU architecture GPU programming I If you have an application where I Data is regular (e.g., arrays) I Computation is regular (e.g., same computation is performed on many array elements) then doing the computation on the GPU is likely to be much faster than doing the computation on the CPU I Issues: I GPU -
SEQUENTIAL LOGIC Combinational: Output Depends Only on Current Inputs Sequential: Output Depend on Current Inputs Plus Past History Includes Memory Elements
10/10/2019 COMBINATIONAL VS. SEQUENTIAL DIGITAL ELECTRONICS LOGIC SYSTEM DESIGN Combinational Circuit Sequential Circuit inputs Combinational outputs inputs Combinational outputs FALL 2019 Logic Logic PROF. IRIS BAHAR (TAUGHT BY JIWON CHOE) OCTOBER 9, 2019 State LECTURE 11: SEQUENTIAL LOGIC Combinational: Output depends only on current inputs Sequential: Output depend on current inputs plus past history Includes memory elements BISTABLE MEMORY STORAGE SEQUENTIAL CIRCUITS ELEMENT Fundamental building block of other state elements Two outputs, Q, Q Outputs depend on inputs and state variables No inputs I1 Q The state variables embody the past Q Q I2 I1 Storage elements hold the state variables A clock periodically advances the circuit I2 Q What does the circuit do? 1 10/10/2019 BISTABLE MEMORY ELEMENT REVISIT NOR & NAND GATES Controlling inputs for NAND and NOR gates 0 • Consider the two possible cases: I1 Q 1 – Q = 0: then Q’ = 1 and Q = 0 (consistent) X X 0 1 1 0 I2 Q 0 1 1 – Q = 1: then Q’ = 0 and Q = 1 (consistent) I1 Q 0 Implementing NOT with NAND/NOR using non-controlling 1 0 I2 Q inputs – Bistable circuit stores 1 bit of state in the X X X’ X’ state variable, Q (or Q’ ) 1 0 • But there are no inputs to control the state S-R (SET/RESET) LATCH S-R LATCH ANALYSIS 0 R N1 Q – S = 1, R = 0: then Q = 1 and Q = 0 . Consider the four possible cases: 1 N2 Q . S = 1, R = 0 S . S = 0, R = 1 R N1 Q 1 R . -
CPE 323 Introduction to Embedded Computer Systems: Introduction
CPE 323 Introduction to Embedded Computer Systems: Introduction Instructor: Dr Aleksandar Milenkovic CPE 323 Administration Syllabus textbook & other references grading policy important dates course outline Prerequisites Number representation Digital design: combinational and sequential logic Computer systems: organization Embedded Systems Laboratory Located in EB 106 EB 106 Policies Introduction sessions Lab instructor CPE 323: Introduction to Embedded Computer Systems 2 CPE 323 Administration LAB Session on-line LAB manuals and tutorials Access cards Accounts Lab Assistant: Zahra Atashi Lab sessions (select 4 from the following list) Monday 8:00 - 9:30 AM Wednesday 8:00 - 9:30 AM Wednesday 5:30 - 7:00 PM Friday 8:00 - 9:30 AM Friday 9:30 – 11:00 AM Sign-up sheet will be available in the laboratory CPE 323: Introduction to Embedded Computer Systems 3 Outline Computer Engineering: Past, Present, Future Embedded systems What are they? Where do we find them? Structure and Organization Software Architectures CPE 323: Introduction to Embedded Computer Systems 4 What Is Computer Engineering? The creative application of engineering principles and methods to the design and development of hardware and software systems Discipline that combines elements of both electrical engineering and computer science Computer engineers are electrical engineers that have additional training in the areas of software design and hardware-software integration CPE 323: Introduction to Embedded Computer Systems 5 What Do Computer Engineers Do? Computer engineers are involved in all aspects of computing Design of computing devices (both Hardware and Software) Where are computing devices? Embedded computer systems (low-end – high-end) In: cars, aircrafts, home appliances, missiles, medical devices,..