CSCI 4717/5717 Computer Architecture Cache Coherent NUMA

Nonuniform Memory Access (NUMA) CSCI 4717/5717 Computer Architecture • Alternative to SMP & clustering • Uniform memory access – All processors have access to all parts of memory using load & store – Access time to all regions of memory is the same Topic: Nonuniform Memory Access – Access time to memory for different processors same and Vector Computation – As used by SMP • Nonuniform memory access Reading: Stallings, Sections 18.5 through 18.6 – All processors have access to all parts of memory using load & store – Access time of processor differs depending on region of memory – Different processors access different regions of memory at different speeds CSCI 4717 – Computer Architecture Parallel Processing – Page 1 CSCI 4717 – Computer Architecture Parallel Processing – Page 2 Cache coherent NUMA (CC-NUMA) NUMA Motivation • Cache coherence is maintained among the • SMP has practical limit to number of caches of the various processors processors – Bus traffic limits to between 16 • Significantly different from SMP and clusters and 64 processors • In clusters each node has own memory – Apps do not see large global memory – Coherence maintained by software not hardware • Software must be tailored to take advantage of cluster characteristics CSCI 4717 – Computer Architecture Parallel Processing – Page 3 CSCI 4717 – Computer Architecture Parallel Processing – Page 4 NUMA Motivation (continued) CC-NUMA Organization • NUMA retains SMP flavour while giving • Looks like SMPs used as nodes of a cluster large scale multiprocessing • Each processor has own L1 and L2 cache • Objective is to maintain transparent system • Each node has own main memory wide memory while permitting • Nodes connected by some networking facility multiprocessor nodes, each with own bus or • Each processor sees single addressable memory internal interconnection system space CSCI 4717 – Computer Architecture Parallel Processing – Page 5 CSCI 4717 – Computer Architecture Parallel Processing – Page 6 1 CC-NUMA Operation Memory Access Sequence • Memory request order: • Each node maintains directory of location of portions of – L1 cache (local to processor) memory and cache status – L2 cache (local to processor) • Example: node 2 processor 3 (P2-3) requests location 798 which is in memory of node 1 – Main memory (local to node) – P2-3 issues read request on snoopy bus of node 2 – Remote memory -- delivered to requesting (local to – Directory on node 2 recognizes location is on node 1 processor) cache – Node 2 directory requests node 1’s directory • Automatic and transparent -- this is difference with – Node 1 directory requests contents of 798 clusters – Node 1 memory puts data on (node 1 local) bus – Node 1 directory gets data from (node 1 local) bus – Data transferred to node 2’s directory – Node 2 directory puts data on (node 2 local) bus – Data picked up, put in P2-3’s cache and delivered to processor CSCI 4717 – Computer Architecture Parallel Processing – Page 7 CSCI 4717 – Computer Architecture Parallel Processing – Page 8 Cache Coherence NUMA Pros & Cons • Node 1 directory keeps note that node 2 has copy • Effective performance at higher levels of of data parallelism than SMP • If data modified in cache, this is broadcast to other • No major software changes nodes • Performance can breakdown if too much access to • Local directories monitor and purge local cache if remote memory – Can be avoided by necessary – L1 & L2 cache design reducing all memory access -- • Local directory monitors changes to local data in Need good temporal locality of software remote caches and marks memory invalid until writeback – Good spatial locality of software – Virtual memory management moving pages to nodes • Local directory forces writeback if memory location that are using them most requested by another processor CSCI 4717 – Computer Architecture Parallel Processing – Page 9 CSCI 4717 – Computer Architecture Parallel Processing – Page 10 NUMA Pros & Cons (continued) Vector Computation • Not transparent – O/S needs to address • Maths problems involving physical changes in use of processes present different difficulties for – page allocation computation – process allocation – Aerodynamics, seismology, meteorology – load balancing – Continuous field simulation • Availability -- Failure of a node is not easily – High precision – Repeated floating point calculations on large arrays of recovered from numbers CSCI 4717 – Computer Architecture Parallel Processing – Page 11 CSCI 4717 – Computer Architecture Parallel Processing – Page 12 2 Vector Computation (continued) Approaches • General purpose computers rely on iteration to do • Supercomputers handle these types of problem vector calculations – Hundreds of millions of flops • This example needs six calculations – $10-15 million – Optimised for calculation rather than multitasking and I/O – Limited market – Research, government agencies, meteorology • Array processor – Alternative to supercomputer – Configured as peripherals to mainframe & mini – Just run vector portion of problems CSCI 4717 – Computer Architecture Parallel Processing – Page 13 CSCI 4717 – Computer Architecture Parallel Processing – Page 14 Vector processing Parallel processing • Assume possible to operate on one-dimensional • Independent processors functioning in vector of data parallel • All elements in a particular row can be calculated • Use FORK N to start individual process at in parallel location N • JOIN N causes N independent processes to join and merge following JOIN • O/S – Co-ordinates JOINs – Execution is blocked until all N processes have reached JOIN CSCI 4717 – Computer Architecture Parallel Processing – Page 15 CSCI 4717 – Computer Architecture Parallel Processing – Page 16 Comparison of Scalar, Vector, and Parallel Processing CSCI 4717 – Computer Architecture Parallel Processing – Page 17 3.

CSCI 4717/5717 Computer Architecture Cache Coherent NUMA

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support