Jakob Engblom, PhD, , Stockholm, Sweden [email protected] My Background

Jakob Engblom Blog at the . MSc, Computer Science, Uppsala . https://software.intel.com/en-us/meet- . PhD, Real-Time Systems, Uppsala the-developers/evangelists/team/jakob- Currently: engblom . Product Management Engineer, Core team, at Intel in Stockholm, Sweden My own blog, since 2007: . Evangelist – Simulation . https://jakob.engbloms.se Previously: . https://www.engbloms.se/jakob.html . IAR Systems, , Wind River Very rarely touch actual hardware . Product management, product marketing, technical sales, technical marketing, business when doing development. Very rarely. development, training development, demos, ...

Copyright Intel 2019 | SAMOS 2019 2 What Does Intel Do?

• Intel® Core® • Intel® ® • SSD • Ethernet • Intel® ™ • Processors • 3D XPoint™ • WiFi • Chipsets • Chipsets • Intel® Optane™ • Bluetooth • Smart NICs • GNSS

Laptop and Server Storage Connectivity desktop

• SoC-FPGA • Movidius • Processors • Development tools • FPGA • Nervana • Gateways • Compilers • FPGA-CPU combo • • Security • Simulation solutions • Intel® Xeon® • Management • & Windows drivers • UEFI & BIOS

FPGA AI and ML IoT Software

Copyright Intel 2019 | SAMOS 2019 3 Copyright Intel 2019 | SAMOS 2019 Hardware: A Hard Development Platform?

Copyright Intel 2019 | SAMOS 2019 5 Hardware is Hard When it is...

Not yet available Flaky prototype stage Not available anymore

Copyright Intel 2019 | SAMOS 2019 6 Hardware is Hard When it is...

Inconveniently large & complex Dangerous to play with Inaccessible & expensive

Copyright Intel 2019 | SAMOS 2019 7 Solution: [Fast] Virtual platform

Full-system virtual platform Apps . Simulated target hardware User-level application code

OS . Real software, same as on the hardware SDKs, libraries, middleware, … . Fast enough to run complete workloads* HW Operating system (OS)

Virtual/simulated Network target hardware (HW) “Free developers from hardware” Virtual platform, like Wind River Simics®

Host operating system * Speed depends on model abstraction level... Host hardware

Copyright Intel 2019 | SAMOS 2019 8 Hardware Not Yet Available: Shift-Left

Hardware/Software Hardware design and production Integration and Test Traditional workflow Hardware-dependent software development

Time

Hardware design and production

Virtual platform Shifting Software development left using and testing shifting left virtual Hardware/Software Integration and Test platforms Hardware-dependent software development

Copyright Intel 2019 | SAMOS 2019 9 Note: Shift Left Applies at Multiple Levels

System architecture Board integration Full-system software Architecture Boot code Integration with physical designs Hardware validation Drivers “Digital Twin” SoC integration Self-test & fault tolerance Management & resiliency Firmware Additional OS support Manufacturing tests Boot code Real-time operating system (RTOS) Deployment and tracking Drivers Board-level SDK Control software Operating System (OS) support Manageability features Applications Compilers Applications … Software Development Kits (SDKs) … Frameworks Application optimization & porting ... OEM PRODUCT

(CUSTOM) BOARD Typically, this is the customer SILICON VENDOR CHIPS of the silicon vendor!

Copyright Intel 2019 | SAMOS 2019 10 (Computer) Architecture with software in the loop

Examples: . Processor, pipeline, cache design . New instructions & execution modes Software . Hardware accelerator design Software Update Software workload software . Hardware-software interface design . Hardware-software codesign & optimization

Virtual platform & Performance, time, power, Design / architecture Build model architecture model statistics, ...

Update design & This is inside a silicon vendor, before chips are manufactured model

Copyright Intel 2019 | SAMOS 2019 11 Hardware Validation and Preparation of Tests

RTL = Register Transfer Level Test software . VHDL, Verilog, etc.

Virtual platform Validate the actual implementation before “tape-in” to the chip fab Use virtual platform to test RTL in a system context . Run real software loads . Run validation software in pre-si RTL implementation . Develop and test post-silicon tests This is inside the silicon vendor

Copyright Intel 2019 | SAMOS 2019 12 Testing large-scale networks Simulate the server in the same way as the other nodes, or connect to a real- Gateway Sensing, actuating, world server communications Cloud Server Small OS

Simulated HW

IO Radio Wireless Communication, mesh management, network Gateway edge analytics

RTOS or Linux Simulation of the Simulation of wireless world network conditions Simulated HW Wind River Simics® Radio LAN Wireless network node Host OS Gateway Host hardware Example of work done by an OEM company building actual nodes and systems https://software.intel.com/en-us/blogs/2018/04/11/1000-machines-in-a-simulation

Copyright Intel 2019 | SAMOS 2019 13 Workload Bring-Up and software validation

SpecJEnterprise driver utility

Update software stack to use latest hardware instruction Application server (payload) Disk image sets and features Database program contents: OS + Java* Virtual Machine (JVM) User land user software User land Ensure integration of Linux* Distribution Linux* Distribution hardware, boot code, drivers, OS, and applications work – UEFI (Unified Extensible Firmware Interface) UEFI before the silicon arrives 96GB RAM 96GB RAM 96GB RAM 96GB RAM

Core Core Core Core Core Core Core Core

Processor PCH Processor Processor PCH Processor socket 1 socket 2 socket 1 socket 2

Disk Disk 10G Eth Network 10G Eth

Future Server Platform 1 – Database server Future Server Platform 2 – App server This particular example: silicon vendor + software vendor cooperating on next-gen hardware tuning Wind River Simics®

https://software.intel.com/en-us/blogs/2018/03/15/software-on-wind-river-simics-virtual-platforms-then-and-now

Copyright Intel 2019 | SAMOS 2019 14 System-level Debug example: Simics-on-Simics

Bug only hits when the file is on an NFS server, and we have Device model working External test program Reproduce at least 2 cores in with the file, coordinated with the external simulator (stand-in for internal the Simics host: Intel simulator) concurrency necessary (Inner) Simics Host OS (SUSE Linux 11) VMXMON driver Repeat

Server hardware

Both mmap() the file Network Reverse

NFS server

Host OS Analyze Server hardware File on disk This kind of work happens at all users of virtual platforms (Outer) Wind RiverSimics®

https://software.intel.com/en-us/blogs/2016/05/30/finding-kernel-1-2-3-bug-running-wind-river-simics-simics

Copyright Intel 2019 | SAMOS 2019 15 Copyright Intel 2019 | SAMOS 2019 16 A melting pot of traditions

1950s “Software” (Fast functional, Virtual machines, …)

1960s “Computer architecture” (Cycle accurate, cache models, …) Current virtual platforms / digital twins 1990s “Hardware designers” (RTL, SystemC*, FPGAs, Emulators, …)

1940s “Mechanical/physics modeling” (Matlab*, FORTRAN*, ..)

Copyright Intel 2019 | SAMOS 2019 17 What is in a virtual platform?

User interface

Simulator infrastructure and features

Virtual platform Processor core API Device models models

Buses and interconnects

Target system

Virtual platform, such as Wind River Simics®

Copyright Intel 2019 | SAMOS 2019 18 How to build a fast virtual platform

Fast Instruction-Set Simulator (ISS) Fast Device Models Functional abstraction level Transaction-Level Modeling (TLM) Just-in-time compilation (JIT) Event-driven simulation Virtualization Simplified timing Simplified timing … Temporal decoupling …

Efficient Framework Tailored Configurations Reduce overheads Configurations optimized for each use case Multithreading Highest-possible level of abstraction Optimize, optimize, optimize … …

Copyright Intel 2019 | SAMOS 2019 19 Instruction-Set Simulation Techniques

Interpreter Fall JIT compiler Fall Virtualization back Target back Target Target

Virtual Platform Virtual Platform Virtual Platform

HOST HOST HOST

Copyright Intel 2019 | SAMOS 2019 20 Processor and system model detail vs speed

(for a typical processor-based system) Functionality 1-5x

Memory latency 2-10x Slowdown effects:

Caches . 10x slowdown = interactively useful 10-100x Branch prediction . 100x slowdown = over-night runs

10,000- . 100,000x slowdown = 70 days to run 1 100,000x minute

OOO, Superscalar, …

Pipeline Full microarchitecture model Note on RTL – it is very slow and very hard to use for architecture exploration RTL on Simulator 1,000,000x – 10,000,000x or worse

Copyright Intel 2019 | SAMOS 2019 21 Building a Transaction-Level Model (functional)

Registers: Register Interfaces: Device interface Configuration, Coding/ specification Generate Bus, interrupts, design control, commands, generate reset, power, … buffers, …

Register specifications come from hardware designers.

If at all possible, get the register Functional specs in a machine readable- Functional behavior Coding specification format that can be used to generate the register code

Fast functional/Transaction-Level model (TLM) Functional specifications are basically the same information that goes into programming manuals.

Copyright Intel 2019 | SAMOS 2019 22 Black Boxes, White Boxes and Abstraction Levels

In a virtual platform, you typically find three types of device/subsystem models:

TLM black- Firmware Firmware box model Detailed ISS Detailed model Detailed model [including ”stubs” TLM TLM TLM and dummies] ISS model model

TLM TLM TLM model model model Detailed model Detailed model Detailed model

TLM White box

“functional” abstraction level Detailed architecture/design model The design model is used to design and validate the actual hardware design– it is a bus-and-cycle level model. 10000x-100000x slow down

Copyright Intel 2019 | SAMOS 2019 23 Note that threading should be applied with care – debug is hard, Multithreading a simulator tuning is complex, correctness tricky Thread across long-latency networks Thread between processor cores

Processor Processor Processor Processor core core core core

Network Memory

Target Target Devices Target Virtual Platform Virtual Platform

Expensive subsystem on a thread Thread to interact with asynchronous world

Processor Processor core core Processor Processor core core

Memory Interface Memory Devices Devices Complex Subsystem Target Target Virtual Platform Virtual Platform External software

Copyright Intel 2019 | SAMOS 2019 24 Architecture: Gear-shift to combine fast and detailed

Drop into detailed mode at interesting points, using checkpoints or direct conversion Use fast virtual platform to quickly get to interesting points in the workload Fast functional simulation . Detailed simulation would be too slow for the complete run Full Warming details . Switch over to detailed simulation at Full Warming interesting points in the workload details . Sampling – only run parts of workload in Full Warming the detailed simulator Detailed runs can execute details in parallel for throughput – How to do this best is a research topic Common in research tools: GEM5, old Simics-GEMS, …

https://jakob.engbloms.se/archives/2514

Copyright Intel 2019 | SAMOS 2019 25 Architecture: Mixing Abstraction Levelsin one setup

Benchmark, traffic generator, real-world application, … Traffic generation

Evaluate the efficiency of Target operating system Device driver OS the software/hardware Network traffic interfaces of the generation inside or accelerator Target machine outside of Simics

Core Core RAM Disk

Firmware Evaluate the performance of the APIC FLASH Ethernet Network accelerators under real workloads Detailed model of the accelerator block USB Serial GPU

Design/architecture model of the accelerator Simics target system model block Wind RiverSimics®

Copyright Intel 2019 | SAMOS 2019 26 Hybrid platforms –validation

Mixing virtual platform models and RTL implementation of parts of a system

. Test actual RTL with software, test software with actual RTL, run hardware validation

. VP provides part of the system, to reduce requirements on hardware resources and increase speed

. Most valuable part to have in VP: processor cores. ISS much faster than putting cores in emulation/prototype

. Transactors convert between fast TLM communication and bit-and-clock-level RTL communication

The IP or set of IPs running User-level in RTL varies across the life test code Single IP cycle and the target of the Operating Bare-metal Cluster validation effort UEFI system test code Sometimes, the picture is inverted and the virtual Core Core APIC Timer Connection to move platform provides the PCH transactions to/from and core and uncore are on detailed system the RTL side. Eth Disk Transactor Transactor Chip

Serial RAM

Virtual platform model RTL in FPGA / Emulator / Simulator

Copyright Intel 2019 | SAMOS 2019 27 Integrating Environment Simulators

Control application

Target OS Actuator simulation IO device IO device Sensor simulation Control computer Simulation of the system mechanics, Simics® electronics, physics, … Simulation of the world in which the system operates System being designed

Complete simulation system/digital twin Different simulators used for different parts of the complete simulation

Copyright Intel 2019 | SAMOS 2019 28 Putting it all together...

User program User program Middleware

Operating system

Sensor Hardware drivers UEFI/BIOS/Boot code IO Firmware Actuator Device Simics Simics Other SystemC* TLM RAM Flash Disk ISS ISS ISS system w/ ISS Physical systems model Subsystem Firmware TLM Model in Simics IO Subsystem with Python Xtor Xtor other framework DML internal ISS Entire chip

Firmware Detailed Simics SystemC TLM PCIe SystemC detailed architecture C/C++ device model model RTL Simulator, FPGA Simics heterogeneous target system model prototype, Big-box Emulator

Wind River Simics® Real hardware

https://software.intel.com/en-us/blogs/2017/07/10/building-virtual-platforms-for-integration

Copyright Intel 2019 | SAMOS 2019 29 Copyright Intel 2019 | SAMOS 2019 30 How critical is Cycle-level simulation?

Typical paper: Maybe instead… . Simulate processor at “cycle-accurate” Capture important effects level to get reviewer acceptance . Counts of memory accesses, operations, events, … . … but does that actually add anything useful to the paper? . Model only what is changed/studied in detail, leave the rest simplified . Carefully analyze and motivate what is Academic cycle-accurate simulators important (and what is just noise) Goal: enable evaluation of the proposed . No relationship to any real cores approach vs other approaches to the . “Accurate” only possible when there is same problem something real to compare to . != cycle count on a fictional processor core

https://jakob.engbloms.se/archives/2321

Copyright Intel 2019 | SAMOS 2019 31 Modeling Accelerators

Graphics, imaging, neural networks, crypto, PE PE PE PE PE PE packet processing… . Many small processing engines (PE) PE PE PE PE PE PE . Specialized instructions and semantics

PE PE PE PE PE PE Simulate on general-purpose cores . Naïve solution is very slow Accelerator . We need it for full-system software runs – (like a video chat applying fun filters to a camera image before sending it out over the network as a video stream… ) Core Core Core Core How to do get results in reasonable time? . Abstraction? Approximation? Host . Parallelization? Use GPUs or Accelerators?

Copyright Intel 2019 | SAMOS 2019 32 Efficient Creation of Efficient Models (Modeling)

Typical virtual platform model library: . 20 standard buses Processor core model . 50 processor cores (10 instruction set architecture) . 1000+ device models

Board . 100+ platform configurations design Device Platform Device Most important problem: config Device Memory Device . Efficiently building platforms and devices map Device Device Device

. (Cores and buses are very reusable) Device

Copyright Intel 2019 | SAMOS 2019 33 FEEDBACK closed-loop Power and energy modeling

Application Power management is a core part of all modern hardware platforms Operating system . Closed-loop feedback control

Driver . Measurements affect future execution path Main processor Run in virtual platform: . Model sensors and actuators in hardware, power modes, … - easy… Firmware Need time to get reasonable results Power management unit . Energy and thermal: time * power . Power mode & gating change time

Hardware sensors and actuators . How to run quickly with useful accuracy?

Copyright Intel 2019 | SAMOS 2019 34 Transaction-Level model Equivalence to Rtl

TLM models are developed pre-RTL. Transaction-Level Can be used to test and develop the specification. Register Modeling (TLM) Cannot be derived from RTL directly. specification device model

Device interface How to prove that these design = two are equivalent? What does equivalence mean?

Functional specification Register-Transfer Level (RTL) implementation of the device Fab

Copyright Intel 2019 | SAMOS 2019 35 Copyright Intel 2019 | SAMOS 2019 36 Don’t Forget –PhD Life is a Good Life!

Phoenix, AZ, for the Real-Time Systems Symposium (RTSS) 1999

Copyright Intel 2019 | SAMOS 2019 37 Copyright Intel 2019 | SAMOS 2019 38 Legal Notice

Intel, the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

© Intel Corporation.

Copyright Intel 2019 | SAMOS 2019 39