How to Optimize the Scalability & Performance of a Multi-Core

How to Optimize the Scalability & Performance of a Multi-Core Operating System Architecting a Scalable Real-Time Application on an SMP Platform Overview W hen upgrading your hardware platform to will not be utilized. Even if the application a newer and more powerful CPU with more, seeks to use multiple cores, other architectural faster cores, you expect the application to run optimizations involving memory access, IO, faster. More cores should reduce the average caching strategies, data synchronization and CPU load and therefore reduce delays. In many more must be considered for the system to cases, however, the application does not run truly achieve optimal scalability. faster and the CPU load is almost the same as While no system delivers linear scalability, for the older CPU. With high-end CPUs, you may you can work to achieve each application’s even see interferences that break determinism. theoretical limit. This paper identifies the key Why does this happen, and what can you do architectural strategies that ensure the best about it? scalability of an RTOS-based application. We The answer: build for scalability. Unless an will explore CPU architectures, explain why application is architected to take advantage of a performance does not get the expected boost multicore environment, most RTOS applications with newer or more powerful cores, describe on 1-core and 4-core IPCs will perform nearly how to reduce the effects of the interferences, identically (contrary to the expectation that and provide recommendations for hardware an RTOS application should scale linearly and modifications to limit bottlenecks . execute 4 times faster on a 4 core IPC than it does on a 1 core IPC.) Without a scalable system, 3 of the 4 cores on the 4 core system How to Optimize the Scalability & Performance of a Multi-Core Operating System [email protected] 781-996-4481 INTERVALZERO.COM 2 Introduction This paper, written for RTOS users, addresses This paper will first examine the CPU architecture a system where both real-time and non-real- relating to caches, memory and I/O access. Then time applications run at the same time. To keep we will explain how threads interact and how determinism in the real-time applications, they program design can help improve performances ideally should not share any hardware with the non- with multiple cores. Finally, we will give examples of real-time applications. But at the same time, it is practical problems and what can be done to solve or helpful to have memory spaces and synchronization minimize them. events available to both sides. Most of the technical information in this paper However, it is not possible to achieve both. Either is based on the excellent paper What Every you have a dedicated real-time computer but must Programmer Should Know About Memory by Ulrich rely on a bus protocol to exchange data with the Drepper at Red Hat. We recommend reading that non-real-time applications, or you have both on the paper if you have the time. same machine but they will share the CPU bus and cache. Nowadays CPU cores are much faster than memory and I/O access, so the interferences come from competition in the access of these resources. There is another important thing to consider when using multiple cores. The different threads in an application usually share variables, so access to these variables has to be synchronized to ensure the values are consistent. The CPU will do this automatically if it is not handled in the code, but as the CPU does not know the whole program, it will not handle it optimally and this will create many delays. These delays are the reason that an application will not necessarily run faster on two cores than on one. How to Optimize the Scalability & Performance of a Multi-Core Operating System [email protected] 781-996-4481 INTERVALZERO.COM 3 1. CPU Architecture CPU CORE CORE CORE CORE 1 2 3 4 RAM NORTHBRIDGE USB PCI-e SOUTHBRIDGE SATA 1.1. Traditional Architecture: UMA (Uniform Memory Access) In this model, all the cores are connected to the same The CPU frequencies have increased steadily for bus, called Front Side Bus (FSB), which links them many years without becoming more expensive but to the Northbridge of the chipset. This RAM and it has not been the same for memory. Persistent its memory controller are connected to this same memory access (such as hard drives) is very slow, so Northbridge. All other hardware is connected to the RAM was introduced to allow the CPU to execute Southbridge which connects to the CPU through the code and access data without having to wait for the Northbridge. hard drive accesses. From this design we can see that the Northbridge, Very fast Static RAM is available, but as it is extremely Southbridge and RAM are resources shared by all expensive it can only be used in low amounts in the cores and therefore by both real-time and non- standard hardware (a few MB). What we commonly real-time applications. Additionally, RAM controllers call RAM in computers is Dynamic RAM, which is only have one port, which means only one core can much cheaper but also much slower than Static access the RAM at a time. RAM. Access to Dynamic RAM takes hundreds of CPU cycles. With multiple cores all accessing this Dynamic RAM, it is easy to see that the FSB and RAM access are the biggest bottlenecks in the traditional architecture. How to Optimize the Scalability & Performance of a Multi-Core Operating System [email protected] 781-996-4481 INTERVALZERO.COM 4 1.2. NUMA (Non-Uniform Memory Access) Architecture To remove the FSB and RAM as bottlenecks, a new architecture was designed with multiple Dynamic RAM modules and multiple busses to access them. Each core could possibly have its own RAM module. The Southbridge to access I/O can also be duplicated so different cores could use different busses to access the hardware. For real-time applications, this would have the added advantage of no longer sharing these resources with non-real-time applications. USB PCI-e SOUTHBRIDGE SATA CPU RAM CORE CORE RAM 1 2 CORE CORE RAM 3 4 RAM USB PCI-e SOUTHBRIDGE SATA Originally NUMA was developed for the of cores on a system exceeds four, the FSB of the interconnection of multiple processors, but as UMA architecture gets very easily overloaded and processors now have more and more cores, it has may cause even more delays. Therefore, NUMA will been extended and used inside processors. be the architecture for larger machines. The NUMA design introduces new problems, as To gain the advantages of NUMA without its issues, variables are only located in a single RAM module some machines are made with nodes of processors while multiple cores may need to access them. sharing a RAM module. In that case applications only Accessing variables attached to a foreign core may be using cores inside a single node would not see the much slower and applications should be developed effects of NUMA and work normally. specifically for this architecture to use it properly. We would recommend only using this architecture for We will not go into more details on this architecture applications developed for it, but when the number as it is not supported by RTX at the moment. How to Optimize the Scalability & Performance of a Multi-Core Operating System [email protected] 781-996-4481 INTERVALZERO.COM 5 1.3. Memory and Caches QUAD CORE CPU WITH TWO LEVELS OF CACHE As mentioned previously, Dynamic RAM access is slow for the CPU (hundreds CPU of cycles on average to access a word) compared to access to Static RAM. CORE CORE CORE CORE 1 2 3 4 Therefore, CPUs now include Static RAM used as cache and organized in multiple L1d L1i L1d L1i L1d L1i L1d L1i levels. When a core needs to access L2 (LLC) data, that data will be copied from the main memory (Dynamic RAM) into its closed cache so it can access it multiple times faster. QUAD CORE CPU WITH THREE LEVELS OF CACHE, LEVEL 2 IS EXCLUSIVE CPU CORE CORE CORE CORE 1 2 3 4 L1d L1i L1d L1i L1d L1i L1d L1i L2 L2 L2 L2 L3 (LLC) QUAD CORE CPU WITH THREE LEVELS OF CACHE, LEVEL 2 IS SHARED BY CORE NODES CPU CORE CORE CORE CORE 1 2 3 4 L1d L1i L1d L1i L1d L1i L1d L1i L2 L2 L3 (LLC) How to Optimize the Scalability & Performance of a Multi-Core Operating System [email protected] 781-996-4481 INTERVALZERO.COM 6 The first level of cache (L1) has separate areas for instructions (the program code) and data (the variables); other levels are unified. Higher levels of cache are larger and slower than the first level and may be exclusive to a core or shared by multiple cores. The largest cache, also called Last Level Cache (LLC), is normally shared by all cores. Average access times for each cache level given in CPU cycles is given in the table below. CACHE LEVEL AVERAGE ACCESS TIME LEVEL 1 ~ 3 LEVEL 2 ~ 15 LEVEL 3 ~ 20 MAIN MEMORY ~ 300 As you can see, performance takes a huge hit any Temporal locality is the reason for having caches. time the CPU has to wait for the main memory to Copying the data to a local buffer before using it is be accessed because the data is not available in only relevant if it will be accessed multiple times. To the cache. This is called a “cache miss”.

How to Optimize the Scalability & Performance of a Multi-Core

Realview Real-Time Library Brochure

Software Vs Hardware Implementations for Real-Time Operating Systems

RTX Supported Operating Systems Matrix

Lab 11: Introduction to RTX Real-Time Operating System

Timing Comparison of the Real-Time Operating Systems for Small Microcontrollers

Mbed OS 5: Adding New Device (Target) Support

TCP/IP Protocol Suite, by Behrouz A

Hard Real‐Time with Intervalzero RTX® on the Windows® Platform

RTX 2016 Product Brief

Migrating to Cortex-M3 Microcontrollers: an RTOS Perspective

RTX9431 Product Sheet V001

RTX8663 IP DECT Base Station W Optional DSP Module Product