Debugging Multicore & Shared- Memory Embedded Systems
Total Page:16
File Type:pdf, Size:1020Kb
Debugging Multicore & Shared- Memory Embedded Systems Classes 249 & 269 2007 edition Jakob Engblom, PhD Virtutech [email protected] 1 Scope & Context of This Talk z Multiprocessor revolution z Programming multicore z (In)determinism z Error sources z Debugging techniques 2 Scope and Context of This Talk z Some material specific to shared-memory symmetric multiprocessors and multicore designs – There are lots of problems particular to this z But most concepts are general to almost any parallel application – The problem is really with parallelism and concurrency rather than a particular design choice 3 Introduction & Background Multiprocessing: what, why, and when? 4 The Multicore Revolution is Here! z The imminent event of parallel computers with many processors taking over from single processors has been declared before... z This time it is for real. Why? z More instruction-level parallelism hard to find – Very complex designs needed for small gain – Thread-level parallelism appears live and well z Clock frequency scaling is slowing drastically – Too much power and heat when pushing envelope z Cannot communicate across chip fast enough – Better to design small local units with short paths z Effective use of billions of transistors – Easier to reuse a basic unit many times z Potential for very easy scaling – Just keep adding processors/cores for higher (peak) performance 5 Parallel Processing z John Hennessy, interviewed in the ACM Queue sees the following eras of computer architecture evolution: 1. Initial efforts and early designs. 1940. ENIAC, Zuse, Manchester, etc. 2. Instruction-Set Architecture. Mid-1960s. Starting with the IBM System/360 with multiple machines with the same compatible instruction set 3. Pipelining and instruction-level parallelism. ~1980. The “RISC revolution”, and the single-core performance work since 4. Explicitly parallel processing. 2005. Mainstream going parallel, parallel going mainstream 6 Multi(core) Processing History z Multiprocessors have been around since the 1950’s – 1959: Burroughs D825, – 1960: Univac LARC, – 1965: Univac 1108A, IBM 360/65, – 1967: CDC6500, – 1982: Cray X-MP – 1984: Transputer T414 7 Multi(core) Processing History z Multicore is more recent – 1995: TI C80: video processor: RISC + 4xDSP on a chip – 1999: Sun MAJC (2) – 2001: IBM Power4 (2): first non-embedded multicore in production – 2002: TI OMAP 5470: (ARM + DSP) – 2004: ARM11 MPCore (4) – 2005: Sun UltraSparc T1 (8x4), AMD Athlon64 (2), IBM XBox 360 CPU (3x2) – 2006: Intel Core Duo (2), Freescale MPC8641D (2), 8572(2), IBM Cell (1x2+8) – 2007: Intel Core 2 Quad (4) 8 Multiprocessing is Here z Multiprocessor and multicore systems are the future z The only option for maximum performance z Some current examples: Vendor Chip #Cores Arch AMP SMP ARM ARM11 MPCore 4 ARMv6 X X Cavium Octeon CN38 16 MIPS64 X Freescale MPC8641D 2 PPC X X IBM 970MP 2 PPC64 X IBM Cell 9 PPC64,DSP X Raza XLR 7-series 8 MIPS64 X PA Semi PA6T custom 8 PPC X TI OMAP2 3 ARM,C55,IVA X 9 Manycore z In specialized naturally parallel domains, there are already many “manycore” chips – Manycore is an emerging term for 10+ cores – Goal: extreme performance/power, performance/chip Vendor Chip #Cores Arch GOps Cisco Metro 188 Tensilica 50 @ 35 W PicoChip PC102 344 Heterogeneous 41.6 @ 160 MHz Ambric AM2045 360 Homogeneous 60 @ 333 MHz ClearSpeed CS301 64 Homogeneous 25.6 @ 2.5 W See Asanovic et al. The Landscape of Parallel Computing Research: A View From Berkeley, Dec 2006 10 The Road Here z One processor used to require several chips z Then one processor could fit on one chip z Now many processors fit on a single chip – This changes everything, since single processors are no longer the default 11 Vocabulary in the Multi Era z Multitasking: multiple tasks running on a single computer Task Task Task z Multiprocessor: multiple Task Task processors used to build a single computer system CPU CPU CPU Computer system 12 Vocabulary in the Multi Era z AMP, Assymetric MP: Each processor has local Task Task Task memory, tasks statically Task Task allocated to one processor CPU CPU z SMP, Shared-Memory MP: Processors share memory, tasks Task Task Task Task Task dynamically scheduled to any processor CPU CPU 13 Vocabulary in the Multi Era z Heterogeneous: Specialization among Control UI Signal Task Task Game Task Media processors. Often Task Task different instruction sets. Usually AMP design. ARM DSP z Homogeneous: all processors have the same instruction set, can Task Task Task Task Task run any task, usually SMP design. PPC PPC PPC 14 Vocabulary in the Multi Era z Multicore: more than CPU CPU CPU one processor on a Chip Chip Chip single chip z CMP, Chip MultiProcessor: Shared- CPU CPU CPU memory multiprocessor Multicore chip on a single chip 15 Vocabulary in the Multi Era z MT, Multithreading: one processor appears as Thread Thread Thread multiple thread. The threads share CPU resources, not as Chip powerful as multiple full processors. Very efficient for certain types of workloads 16 Vocabulary in the Multi Era z Cache coherency: CPU CPU CPU – Fundamental technology for shared memory L1$ L1$ L1$ – Local caches for each processor in the system Shared – Multiple copies of shared data Memory can be present in caches – To maintain correct function, System caches have to be coherent z When one processor YouYou wantwant datadata toto bebe coherentcoherent changes shared data, no acrossacross allall cachescaches andand thethe other cache will contain old sharedshared memorymemory inin aa systemsystem data (eventually) 17 Future Embedded Systems CPU CPU CPU CPU CPU CPU L1$ L1$ L1$ L1$ L1$ L1$ L2$ L2$ Timer Serial Timer Serial RAM RAM etc. Network Network etc. Devices Devices Multicore node Multicore node One shared memory space Network with local memory in each node 18 Programming Models 19 The Software becomes the Problem z Parallelism required to gain performance – Parallel hardware is “easy” to design – Parallel software is (very) hard to write z Fundamentally hard to grasp true concurrency – Especially in complex software environments z Existing software assumes single-processor – Might break in new and interesting ways – Multitasking no guarantee to run on multiprocessor 20 Process, Thread, Task Thread Thread Thread Thread Thread Process Process Task Task Task Task Task Task Task Application App Operating system Operating system Operating system Desktop/Server model: each process Simple RTOS model: OS and all in its own memory space, several Generic model: a number of tasks tasks share the same memory space, threads in each process with access to share some memory in order to all memory accessible to all the same memory. Memory protected implement an application between processes. z This talk will use “task” for any software thread of control 21 Message-Passing & Shared-Memory z Local memory for each z All tasks can access the task, explicit messages same memory, for communication communication is implicit Task Task Task Task Task Task Task Task Task Shared Data Data Data Data MostMost commoncommon modelmodel Application Application presentedpresented byby hardwarehardware andand operatingoperating systemssystems 22 Programming Parallel Machines z Synchronize & coordinate execution z Communicate data & status between tasks z Ensure parallelism-safe access to shared data z Components of the shared-memory solution: – All tasks see the same memory – Locks to protect shared data accesses – Synchronization primitives to coordinate execution 23 Success: Classic Supercomputing z Regular programs z Parallelized loops + serial sections z Data dependencies Task Task Task Task between tasks z Very high scalability, 1000s of processors z FORTRAN, OpenMP, Task Task Task pthreads, MPI Main Task 24 Success: Servers z Natural parallelism z Irregular length of parallel code, dynamic creation z Master task coordinates Client z Slave tasks for each connection client connection Client Client z Scales very well – using a strong database for the common data Client Client z C, C++, OpenMP, OS API, Client Client MPI, pthreads, ... Main Data- Task base 25 Success: Signal Processing z “Embarrassing” natural parallelism – No shared data – No communication – No synchronization z Single controller CPU z Parallelizes to 1000s of tasks and processors Cntrl DSP DSP DSP DSP DSP z Good fit for AMP Task Task Task Task Task Task 26 Programming model: Posix Threads main() { z Standard API ... pthread_t p_threads[MAX_THREADS]; z Explicit operations pthread_attr_t attr; pthread_attr_init (&attr); z Strong programmer for (i=0; i< num_threads; i++) { control, arbitrary work in hits[i] = i; pthread_create(&p_threads[i], &attr, each thread compute_pi, (void *) &hits[i]); z Create & manipulate } for (i=0; i< num_threads; i++) { – Locks pthread_join(p_threads[i], NULL); total_hits += hits[i]; – Mutexes } – Threads ... – etc. 27 Programming model: OpenMP #pragma omp parallel private(nthreads, tid) z Compiler directives { tid = omp_get_thread_num(); z Special support in printf("Hello World from thread = %d\n",tid); the compiler if (tid == 0) { z Focus on loop-level nthreads = omp_get_num_threads(); printf("Number of threads: %d\n",nthreads); parallel execution } } z Generates calls to threading libraries z Popular in high-end embedded 28 Programming model: MPI main(int argc, char *argv[]) z Message-passing API { – Explicit messages for int npes, myrank; communication MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); – Explicit distribution of data MPI_Comm_rank(MPI_COMM_WORLD, &myrank); to each