Jakob Engblom, PhD, Intel, Stockholm, Sweden [email protected] My Background
Jakob Engblom Blog at the Intel Developer Zone . MSc, Computer Science, Uppsala . https://software.intel.com/en-us/meet- . PhD, Real-Time Systems, Uppsala the-developers/evangelists/team/jakob- Currently: engblom . Product Management Engineer, Simics Core team, at Intel in Stockholm, Sweden My own blog, since 2007: . Software Evangelist – Simulation . https://jakob.engbloms.se Previously: . https://www.engbloms.se/jakob.html . IAR Systems, Virtutech, Wind River Very rarely touch actual hardware . Product management, product marketing, technical sales, technical marketing, business when doing development. Very rarely. development, training development, demos, ...
Copyright Intel 2019 | SAMOS 2019 2 What Does Intel Do?
• Intel® Core® • Intel® Xeon® • SSD • Ethernet • Intel® Atom™ • Processors • 3D XPoint™ • WiFi • Chipsets • Chipsets • Intel® Optane™ • Bluetooth • Smart NICs • GNSS
Laptop and Server Storage Connectivity desktop
• SoC-FPGA • Movidius • Processors • Development tools • FPGA • Nervana • Gateways • Compilers • FPGA-CPU combo • MobilEye • Security • Simulation solutions • Intel® Xeon® • Management • Linux & Windows drivers • UEFI & BIOS
FPGA AI and ML IoT Software
Copyright Intel 2019 | SAMOS 2019 3 Copyright Intel 2019 | SAMOS 2019 Hardware: A Hard Development Platform?
Copyright Intel 2019 | SAMOS 2019 5 Hardware is Hard When it is...
Not yet available Flaky prototype stage Not available anymore
Copyright Intel 2019 | SAMOS 2019 6 Hardware is Hard When it is...
Inconveniently large & complex Dangerous to play with Inaccessible & expensive
Copyright Intel 2019 | SAMOS 2019 7 Solution: [Fast] Virtual platform
Full-system virtual platform Apps . Simulated target hardware User-level application code
OS . Real software, same as on the hardware SDKs, libraries, middleware, … . Fast enough to run complete workloads* HW Operating system (OS)
Virtual/simulated Network target hardware (HW) “Free developers from hardware” Virtual platform, like Wind River Simics®
Host operating system * Speed depends on model abstraction level... Host hardware
Copyright Intel 2019 | SAMOS 2019 8 Hardware Not Yet Available: Shift-Left
Hardware/Software Hardware design and production Integration and Test Traditional workflow Hardware-dependent software development
Time
Hardware design and production
Virtual platform Shifting Software development left using and testing shifting left virtual Hardware/Software Integration and Test platforms Hardware-dependent software development
Copyright Intel 2019 | SAMOS 2019 9 Note: Shift Left Applies at Multiple Levels
System architecture Board integration Full-system software Architecture Boot code Integration with physical designs Hardware validation Drivers “Digital Twin” SoC integration Self-test & fault tolerance Management & resiliency Firmware Additional OS support Manufacturing tests Boot code Real-time operating system (RTOS) Deployment and tracking Drivers Board-level SDK Control software Operating System (OS) support Manageability features Applications Compilers Applications … Software Development Kits (SDKs) … Frameworks Application optimization & porting ... OEM PRODUCT
(CUSTOM) BOARD Typically, this is the customer SILICON VENDOR CHIPS of the silicon vendor!
Copyright Intel 2019 | SAMOS 2019 10 (Computer) Architecture with software in the loop
Examples: . Processor, pipeline, cache design . New instructions & execution modes Software . Hardware accelerator design Software Update Software workload software . Hardware-software interface design . Hardware-software codesign & optimization
Virtual platform & Performance, time, power, Design / architecture Build model architecture model statistics, ...
Update design & This is inside a silicon vendor, before chips are manufactured model
Copyright Intel 2019 | SAMOS 2019 11 Hardware Validation and Preparation of Tests
RTL = Register Transfer Level Test software . VHDL, Verilog, etc.
Virtual platform Validate the actual implementation before “tape-in” to the chip fab Use virtual platform to test RTL in a system context . Run real software loads . Run validation software in pre-si RTL implementation . Develop and test post-silicon tests This is inside the silicon vendor
Copyright Intel 2019 | SAMOS 2019 12 Testing large-scale networks Simulate the server in the same way as the other nodes, or connect to a real- Gateway Sensing, actuating, world server communications Cloud Server Small OS
Simulated HW
IO Radio Wireless Communication, mesh management, network Gateway edge analytics
RTOS or Linux Simulation of the Simulation of wireless world network conditions Simulated HW Wind River Simics® Radio LAN Wireless network node Host OS Gateway Host hardware Example of work done by an OEM company building actual nodes and systems https://software.intel.com/en-us/blogs/2018/04/11/1000-machines-in-a-simulation
Copyright Intel 2019 | SAMOS 2019 13 Workload Bring-Up and software validation
SpecJEnterprise driver utility
Update software stack to use latest hardware instruction Application server (payload) Disk image sets and features Database program contents: OS + Java* Virtual Machine (JVM) User land user software User land Ensure integration of Linux* Distribution Linux* Distribution hardware, boot code, drivers, OS, and applications work – UEFI (Unified Extensible Firmware Interface) UEFI before the silicon arrives 96GB RAM 96GB RAM 96GB RAM 96GB RAM
Core Core Core Core Core Core Core Core
Processor PCH Processor Processor PCH Processor socket 1 socket 2 socket 1 socket 2
Disk Disk 10G Eth Network 10G Eth
Future Server Platform 1 – Database server Future Server Platform 2 – App server This particular example: silicon vendor + software vendor cooperating on next-gen hardware tuning Wind River Simics®
https://software.intel.com/en-us/blogs/2018/03/15/software-on-wind-river-simics-virtual-platforms-then-and-now
Copyright Intel 2019 | SAMOS 2019 14 System-level Debug example: Simics-on-Simics
Bug only hits when the file is on an NFS server, and we have Device model working External test program Reproduce at least 2 cores in with the file, coordinated with the external simulator (stand-in for internal the Simics host: Intel simulator) concurrency necessary (Inner) Simics Host OS (SUSE Linux 11) VMXMON driver Repeat
Server hardware
Both mmap() the file Network Reverse
NFS server
Host OS Analyze Server hardware File on disk This kind of work happens at all users of virtual platforms (Outer) Wind RiverSimics®
https://software.intel.com/en-us/blogs/2016/05/30/finding-kernel-1-2-3-bug-running-wind-river-simics-simics
Copyright Intel 2019 | SAMOS 2019 15 Copyright Intel 2019 | SAMOS 2019 16 A melting pot of traditions
1950s “Software” (Fast functional, Virtual machines, …)
1960s “Computer architecture” (Cycle accurate, cache models, …) Current virtual platforms / digital twins 1990s “Hardware designers” (RTL, SystemC*, FPGAs, Emulators, …)
1940s “Mechanical/physics modeling” (Matlab*, FORTRAN*, ..)
Copyright Intel 2019 | SAMOS 2019 17 What is in a virtual platform?
User interface
Simulator infrastructure and features
Virtual platform Processor core API Device models models
Buses and interconnects
Target system
Virtual platform, such as Wind River Simics®
Copyright Intel 2019 | SAMOS 2019 18 How to build a fast virtual platform
Fast Instruction-Set Simulator (ISS) Fast Device Models Functional abstraction level Transaction-Level Modeling (TLM) Just-in-time compilation (JIT) Event-driven simulation Virtualization Simplified timing Simplified timing … Temporal decoupling …
Efficient Framework Tailored Configurations Reduce overheads Configurations optimized for each use case Multithreading Highest-possible level of abstraction Optimize, optimize, optimize … …
Copyright Intel 2019 | SAMOS 2019 19 Instruction-Set Simulation Techniques
Interpreter Fall JIT compiler Fall Virtualization back Target back Target Target
Virtual Platform Virtual Platform Virtual Platform
HOST HOST HOST
Copyright Intel 2019 | SAMOS 2019 20 Processor and system model detail vs speed
(for a typical processor-based system) Functionality 1-5x
Memory latency 2-10x Slowdown effects:
Caches . 10x slowdown = interactively useful 10-100x Branch prediction . 100x slowdown = over-night runs
10,000- . 100,000x slowdown = 70 days to run 1 100,000x minute
OOO, Superscalar, …
Pipeline Full microarchitecture model Note on RTL – it is very slow and very hard to use for architecture exploration RTL on Simulator 1,000,000x – 10,000,000x or worse
Copyright Intel 2019 | SAMOS 2019 21 Building a Transaction-Level Model (functional)
Registers: Register Interfaces: Device interface Configuration, Coding/ specification Generate Bus, interrupts, design control, commands, generate reset, power, … buffers, …
Register specifications come from hardware designers.
If at all possible, get the register Functional specs in a machine readable- Functional behavior Coding specification format that can be used to generate the register code
Fast functional/Transaction-Level model (TLM) Functional specifications are basically the same information that goes into programming manuals.
Copyright Intel 2019 | SAMOS 2019 22 Black Boxes, White Boxes and Abstraction Levels
In a virtual platform, you typically find three types of device/subsystem models:
TLM black- Firmware Firmware box model Detailed ISS Detailed model Detailed model [including ”stubs” TLM TLM TLM and dummies] ISS model model
TLM TLM TLM model model model Detailed model Detailed model Detailed model
TLM White box
“functional” abstraction level Detailed architecture/design model The design model is used to design and validate the actual hardware design– it is a bus-and-cycle level model. 10000x-100000x slow down
Copyright Intel 2019 | SAMOS 2019 23 Note that threading should be applied with care – debug is hard, Multithreading a simulator tuning is complex, correctness tricky Thread across long-latency networks Thread between processor cores
Processor Processor Processor Processor core core core core
Network Memory
Target Target Devices Target Virtual Platform Virtual Platform
Expensive subsystem on a thread Thread to interact with asynchronous world
Processor Processor core core Processor Processor core core
Memory Interface Memory Devices Devices Complex Subsystem Target Target Virtual Platform Virtual Platform External software
Copyright Intel 2019 | SAMOS 2019 24 Architecture: Gear-shift to combine fast and detailed
Drop into detailed mode at interesting points, using checkpoints or direct conversion Use fast virtual platform to quickly get to interesting points in the workload Fast functional simulation . Detailed simulation would be too slow for the complete run Full Warming details . Switch over to detailed simulation at Full Warming interesting points in the workload details . Sampling – only run parts of workload in Full Warming the detailed simulator Detailed runs can execute details in parallel for throughput – How to do this best is a research topic Common in research tools: GEM5, old Simics-GEMS, …
https://jakob.engbloms.se/archives/2514
Copyright Intel 2019 | SAMOS 2019 25 Architecture: Mixing Abstraction Levelsin one setup
Benchmark, traffic generator, real-world application, … Traffic generation
Evaluate the efficiency of Target operating system Device driver OS the software/hardware Network traffic interfaces of the generation inside or accelerator Target machine outside of Simics
Core Core RAM Disk
Firmware Evaluate the performance of the APIC FLASH Ethernet Network accelerators under real workloads Detailed model of the accelerator block USB Serial GPU
Design/architecture model of the accelerator Simics target system model block Wind RiverSimics®
Copyright Intel 2019 | SAMOS 2019 26 Hybrid platforms –validation
Mixing virtual platform models and RTL implementation of parts of a system
. Test actual RTL with software, test software with actual RTL, run hardware validation
. VP provides part of the system, to reduce requirements on hardware resources and increase speed
. Most valuable part to have in VP: processor cores. ISS much faster than putting cores in emulation/prototype
. Transactors convert between fast TLM communication and bit-and-clock-level RTL communication
The IP or set of IPs running User-level in RTL varies across the life test code Single IP cycle and the target of the Operating Bare-metal Cluster validation effort UEFI system test code Sometimes, the picture is inverted and the virtual Core Core APIC Timer Connection to move platform provides the PCH transactions to/from and core and uncore are on detailed system the RTL side. Eth Disk Transactor Transactor Chip
Serial RAM
Virtual platform model RTL in FPGA / Emulator / Simulator
Copyright Intel 2019 | SAMOS 2019 27 Integrating Environment Simulators
Control application
Target OS Actuator simulation IO device IO device Sensor simulation Control computer Simulation of the system mechanics, Simics® electronics, physics, … Simulation of the world in which the system operates System being designed
Complete simulation system/digital twin Different simulators used for different parts of the complete simulation
Copyright Intel 2019 | SAMOS 2019 28 Putting it all together...
User program User program Middleware
Operating system
Sensor Hardware drivers UEFI/BIOS/Boot code IO Firmware Actuator Device Simics Simics Other SystemC* TLM RAM Flash Disk ISS ISS ISS system w/ ISS Physical systems model Subsystem Firmware TLM Model in Simics IO Subsystem with Python Xtor Xtor other framework DML internal ISS Entire chip
Firmware Detailed Simics SystemC TLM PCIe SystemC detailed architecture C/C++ device model model RTL Simulator, FPGA Simics heterogeneous target system model prototype, Big-box Emulator
Wind River Simics® Real hardware
https://software.intel.com/en-us/blogs/2017/07/10/building-virtual-platforms-for-integration
Copyright Intel 2019 | SAMOS 2019 29 Copyright Intel 2019 | SAMOS 2019 30 How critical is Cycle-level simulation?
Typical paper: Maybe instead… . Simulate processor at “cycle-accurate” Capture important effects level to get reviewer acceptance . Counts of memory accesses, operations, events, … . … but does that actually add anything useful to the paper? . Model only what is changed/studied in detail, leave the rest simplified . Carefully analyze and motivate what is Academic cycle-accurate simulators important (and what is just noise) Goal: enable evaluation of the proposed . No relationship to any real cores approach vs other approaches to the . “Accurate” only possible when there is same problem something real to compare to . != cycle count on a fictional processor core
https://jakob.engbloms.se/archives/2321
Copyright Intel 2019 | SAMOS 2019 31 Modeling Accelerators
Graphics, imaging, neural networks, crypto, PE PE PE PE PE PE packet processing… . Many small processing engines (PE) PE PE PE PE PE PE . Specialized instructions and semantics
PE PE PE PE PE PE Simulate on general-purpose cores . Naïve solution is very slow Accelerator . We need it for full-system software runs – (like a video chat applying fun filters to a camera image before sending it out over the network as a video stream… ) Core Core Core Core How to do get results in reasonable time? . Abstraction? Approximation? Host . Parallelization? Use GPUs or Accelerators?
Copyright Intel 2019 | SAMOS 2019 32 Efficient Creation of Efficient Models (Modeling)
Typical virtual platform model library: . 20 standard buses Processor core model . 50 processor cores (10 instruction set architecture) . 1000+ device models
Board . 100+ platform configurations design Device Platform Device Most important problem: config Device Memory Device . Efficiently building platforms and devices map Device Device Device
. (Cores and buses are very reusable) Device
Copyright Intel 2019 | SAMOS 2019 33 FEEDBACK closed-loop Power and energy modeling
Application Power management is a core part of all modern hardware platforms Operating system . Closed-loop feedback control
Driver . Measurements affect future execution path Main processor Run in virtual platform: . Model sensors and actuators in hardware, power modes, … - easy… Firmware Need time to get reasonable results Power management unit . Energy and thermal: time * power . Power mode & gating change time
Hardware sensors and actuators . How to run quickly with useful accuracy?
Copyright Intel 2019 | SAMOS 2019 34 Transaction-Level model Equivalence to Rtl
TLM models are developed pre-RTL. Transaction-Level Can be used to test and develop the specification. Register Modeling (TLM) Cannot be derived from RTL directly. specification device model
Device interface How to prove that these design = two are equivalent? What does equivalence mean?
Functional specification Register-Transfer Level (RTL) implementation of the device Fab
Copyright Intel 2019 | SAMOS 2019 35 Copyright Intel 2019 | SAMOS 2019 36 Don’t Forget –PhD Life is a Good Life!
Phoenix, AZ, for the Real-Time Systems Symposium (RTSS) 1999
Copyright Intel 2019 | SAMOS 2019 37 Copyright Intel 2019 | SAMOS 2019 38 Legal Notice
Intel, the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.
Copyright Intel 2019 | SAMOS 2019 39