<<

Computer Architecture and

Gabriel Laskar

EPITA

2015 License

I Copyright 2004-2005, ACU, Benoit Perrot

I Copyright c 2004-2008, Alexandre Becoulet

I Copyright c 2009-2013, Nicolas Pouillon

I Copyright c 2014, Joël Porquet

I Copyright c 2015, Gabriel Laskar

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Foundation; with the Invariant Sections being just ‘‘Copying this document’’, no Front-Cover Texts, and no Back-Cover Texts. Introduction

Part I

Introduction

Gabriel Laskar (EPITA) CAAL 2015 3 / 378 Introduction Problem definition 1: Introduction

Problem definition

Outline

Gabriel Laskar (EPITA) CAAL 2015 4 / 378 Introduction Problem definition What are we trying to learn? Architecture

What is in the hardware?

I A of history of , current machines

I Concepts and conventions: processing, memory, communication, optimization

How does a machine run code?

I Program execution model

I Memory mapping, OS support

Gabriel Laskar (EPITA) CAAL 2015 5 / 378 Introduction Problem definition What are we trying to learn? Assembly Language

How to “talk” with the machine directly?

I Mechanisms involved

I Assembly language structure and usage

I Low-level language features

I C inline assembly

Gabriel Laskar (EPITA) CAAL 2015 6 / 378 I

I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers C/C++

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers C/C++, Objective-C

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers C/C++, Objective-C, C#

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers C/C++, Objective-C, C#,

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers C/C++, Objective-C, C#, Java, JS/AS

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers

Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers at any level: (C/C++, Objective-C, C#, Java, JS/AS, etc.)

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 Introduction Problem definition Who do I talk to?

I System gurus

I Low-level enthusiasts

I Programmers at any level: (C/C++, Objective-C, C#, Java, JS/AS, etc.)

I Wise managers

Gabriel Laskar (EPITA) CAAL 2015 7 / 378 Introduction Outline 1: Introduction

Problem definition

Outline

Gabriel Laskar (EPITA) CAAL 2015 8 / 378 Introduction Outline Course outline

I Processor architecture

I Memory

I Memory mapping

I Execution flow

I Object file formats

I Assembly programming

I Focus on

I Focus on RISC processors

I CPU-aware optimizations

I Multi-/Many-core, heterogeneous systems

Gabriel Laskar (EPITA) CAAL 2015 9 / 378 Processor architecture

Part II

Processor architecture

Gabriel Laskar (EPITA) CAAL 2015 10 / 378 Processor architecture Overview 2: Processor architecture

Overview

Inside the processor

Processor units

Instructions

Instruction flow

Pipeline processor

CISC & RISC architectures

Gabriel Laskar (EPITA) CAAL 2015 11 / 378 Let’s design it!

Processor architecture Overview What a processor is...

A processor must be able to perform the following tasks:

I Execute instructions

I Read

I Store results

It needs several basic units to perform those tasks:

I A control unit

I An arithmetic and logical unit (ALU)

I A register bank

Gabriel Laskar (EPITA) CAAL 2015 12 / 378 Processor architecture Overview What a processor is...

A processor must be able to perform the following basic tasks:

I Execute instructions

I Read operands

I Store results

It needs several basic units to perform those tasks:

I A control unit

I An arithmetic and logical unit (ALU)

I A register bank

Let’s design it!

Gabriel Laskar (EPITA) CAAL 2015 12 / 378 Processor architecture Overview Basic architecture

R0 R1 R2 = 0123 R3 = 0100 R4 = 0023

R5 Registers

ALU Control Unit

Gabriel Laskar (EPITA) CAAL 2015 13 / 378 Unfortunately,

I Internal memory is expensive and hard to design

I There is no communication

I Updating the program may not be easy

We need an access to memory, external devices, etc.

Processor architecture Overview Basic architecture (2)

In this model, the system state is entirely contained in the processor.

I This might be sufficient for a very basic processor

I More features could be leveraged by adding registers or program steps

Gabriel Laskar (EPITA) CAAL 2015 14 / 378 We need an access to memory, external devices, etc.

Processor architecture Overview Basic architecture (2)

In this model, the system state is entirely contained in the processor.

I This might be sufficient for a very basic processor

I More features could be leveraged by adding registers or program steps

Unfortunately,

I Internal memory is expensive and hard to design

I There is no communication

I Updating the program may not be easy

Gabriel Laskar (EPITA) CAAL 2015 14 / 378 Processor architecture Overview Basic architecture (2)

In this model, the system state is entirely contained in the processor.

I This might be sufficient for a very basic processor

I More features could be leveraged by adding registers or program steps

Unfortunately,

I Internal memory is expensive and hard to design

I There is no communication

I Updating the program may not be easy

We need an access to memory, external devices, etc.

Gabriel Laskar (EPITA) CAAL 2015 14 / 378 It needs several basic units to perform those tasks:

I A control unit

I An arithmetic and logical unit (ALU)

I A register bank

I A program memory access

I A memory access

Processor architecture Overview Revised processor model

A processor must be able to perform the following basic tasks:

I Fetch instructions from an external entity and understand them (fetch and decode)

I Execute instructions

I Store results to registers or external memory

Gabriel Laskar (EPITA) CAAL 2015 15 / 378 Processor architecture Overview Revised processor model

A processor must be able to perform the following basic tasks:

I Fetch instructions from an external entity and understand them (fetch and decode)

I Execute instructions

I Store results to registers or external memory

It needs several basic units to perform those tasks:

I A control unit

I An arithmetic and logical unit (ALU)

I A register bank

I A program memory access

I A data memory access

Gabriel Laskar (EPITA) CAAL 2015 15 / 378 Processor architecture Overview Revised processor model (2)

R0 R1 R2 = 0123 R3 = 0100 R4 = 0023

R5 Registers

ALU Control Unit Instruction @ Code Command Data Data @ W Data

Gabriel Laskar (EPITA) CAAL 2015 16 / 378 Processor architecture Overview Processor physical layout

A processor is composed of many different units:

I Caches, MMU

I Integer unit, Control unit, Floating-point unit Each unit is:

I implemented as an hardware component

I made of switchable parts (transistors)

In old processors:

I Units used to be independent chips

I Some were even optional “coprocessors” Today, processors are embedded on a single chip.

Gabriel Laskar (EPITA) CAAL 2015 17 / 378 Processor architecture Inside the processor 2: Processor architecture

Overview

Inside the processor

Processor units

Instructions

Instruction flow

Pipeline processor

CISC & RISC architectures

Gabriel Laskar (EPITA) CAAL 2015 18 / 378 Processor architecture Inside the processor Processor package

Gabriel Laskar (EPITA) CAAL 2015 19 / 378 Processor architecture Inside the processor Processor physical layout

Gabriel Laskar (EPITA) CAAL 2015 20 / 378 Processor architecture Inside the processor Processor physical layout

Gabriel Laskar (EPITA) CAAL 2015 21 / 378 Processor architecture Inside the processor Processor physical layout Transistor details

Gabriel Laskar (EPITA) CAAL 2015 22 / 378 Processor architecture Processor units 2: Processor architecture

Overview

Inside the processor

Processor units

Instructions

Instruction flow

Pipeline processor

CISC & RISC architectures

Gabriel Laskar (EPITA) CAAL 2015 23 / 378 Processor architecture Processor units Units

R0 R1 R2 = 0123 R3 = 0100 I Control unit fetches and R4 = 0023

decodes the instruction R5 Registers

I Registers gives the data

I ALU implements the

operation ALU I Some instructions access Control Unit external data Instruction @ Code Command R Data Data @ W Data

Gabriel Laskar (EPITA) CAAL 2015 24 / 378 Processor architecture Processor units Registers May be seen as variables located inside the processor

I 8, 16, 32, 64, 128, ... - large I General-purpose registers:

I integer () I floating point (float, double) I Specialized registers:

I flags

I Zero, I Negative, I Carry, I Overflow, I etc.

I system

I Mode, I IRQ masking, I etc.

Gabriel Laskar (EPITA) CAAL 2015 25 / 378 Processor architecture Processor units ALU: Arithmetic and Logical Unit An unit without registers!

I Logical operations

I AND, OR, XOR, NOT, NOR I Arithmetic operations

I addition, subtraction, multiplication, division

I Shifts

I Compares

Division is not possible without registers!

Gabriel Laskar (EPITA) CAAL 2015 26 / 378 Processor architecture Instructions 2: Processor architecture

Overview

Inside the processor

Processor units

Instructions

Instruction flow

Pipeline processor

CISC & RISC architectures

Gabriel Laskar (EPITA) CAAL 2015 27 / 378 Processor architecture Instructions Instruction types

There are different instruction types:

I Arithmetic and logical operations

I Control instructions

I Memory access instructions

Gabriel Laskar (EPITA) CAAL 2015 28 / 378 Processor architecture Instructions Instructions classification

Flynn’s taxonomy:

I SISD: Single Instruction, Single Data Classical

I SIMD: Single Instruction, Multiple Data Vectorial computers

I MIMD: Multiple Instruction, Multiple Data Multiprocessor computers

Gabriel Laskar (EPITA) CAAL 2015 29 / 378 Processor architecture Instruction flow 2: Processor architecture

Overview

Inside the processor

Processor units

Instructions

Instruction flow

Pipeline processor

CISC & RISC architectures

Gabriel Laskar (EPITA) CAAL 2015 30 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch

jmp

ld PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod

jmp

ld PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec

jmp

ld PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp

ld PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch

ld PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch Decod

ld PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch Decod Exec

ld PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch Decod Exec

ld Fetch PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch Decod Exec

ld Fetch Decod PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch Decod Exec

ld Fetch Decod Exec PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch Decod Exec

ld Fetch Decod Exec Mem PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow

Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)

A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.

sub Fetch Decod Exec Write

jmp Fetch Decod Exec

ld Fetch Decod Exec Mem Write PC address

Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Pros:

I Easy to implement

I Small

I New instructions can be added just by adding new steps

Processor architecture Instruction flow Microprogrammed processor Pro/cons

Cons:

I Slow

I The more complex the instructions are, the longer they take to get processed

I Most of the hardware is used only once for each instruction

I Most of the hardware is unused most of the time

Gabriel Laskar (EPITA) CAAL 2015 32 / 378 Processor architecture Instruction flow Microprogrammed processor Pro/cons

Cons:

I Slow

I The more complex the instructions are, the longer they take to get processed

I Most of the hardware is used only once for each instruction

I Most of the hardware is unused most of the time

Pros:

I Easy to implement

I Small

I New instructions can be added just by adding new steps

Gabriel Laskar (EPITA) CAAL 2015 32 / 378 Processor architecture Pipeline processor 2: Processor architecture

Overview

Inside the processor

Processor units

Instructions

Instruction flow

Pipeline processor

CISC & RISC architectures

Gabriel Laskar (EPITA) CAAL 2015 33 / 378 Processor architecture Pipeline processor Pipelined processor Instruction flow A pipeline architecture enables the parallel execution of several instructions:

I Split the execution of each instruction in several steps I Each step performs an elementary operation I Each step is associated to a specific part of the hardware I All parts of the hardware work in parallel

Fetch Decode Exec Memory Writeback

Instruction Data memory memory access access

Gabriel Laskar (EPITA) CAAL 2015 34 / 378 Processor architecture Pipeline processor Pipeline Flow

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow

sub Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow

sub Fetch Decod Exec

not Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow

sub Fetch Decod Exec Mem

not Fetch Decod Exec

add Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem

add Fetch Decod Exec

jmp Fetch Decod Instruction sequence ? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem Write

add Fetch Decod Exec Mem

jmp Fetch Decod Exec Instruction sequence sub Fetch Decod

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem Write

add Fetch Decod Exec Mem Write

jmp Fetch Decod Exec Mem Instruction sequence sub Fetch Decod Exec

add Fetch Decod

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Speeding up the pipeline

Current processors extensively use the pipeline architecture to accelerate the execution of instructions.

Because a pipeline architecture works in parallel, the slowest step delay determines the pipeline global cycle delay (and working frequency).

Step 1 Step 2 Step 3

10 ns 18 ns 8 ns

Gabriel Laskar (EPITA) CAAL 2015 36 / 378 Processor architecture Pipeline processor Speeding up the pipeline Splitting operations in shorter steps enables the processor frequency to increase.

Step 1 Step 2 Step 3

10 ns 18 ns 8 ns

Step 1 Step 2.1 Step 2.2 Step 3

10 ns 9 ns 9 ns 8 ns

Gabriel Laskar (EPITA) CAAL 2015 37 / 378 Processor architecture CISC & RISC architectures 2: Processor architecture

Overview

Inside the processor

Processor units

Instructions

Instruction flow

Pipeline processor

CISC & RISC architectures

Gabriel Laskar (EPITA) CAAL 2015 38 / 378 Processor architecture CISC & RISC architectures Instruction-set classification

Based on internal architecture and instructions formats, processor architectures may be classified in two groups:

I Complex Instruction Set Computer (CISC)

I Reduced Instruction Set Computer (RISC)

Early processor architectures were mostly CISC-based: z80, x86, 68000, etc.

More recent designs are rather RISC-based: MIPS, Sparc, Alpha, PowerPC, ARM, etc.

Gabriel Laskar (EPITA) CAAL 2015 39 / 378 Processor architecture CISC & RISC architectures RISC Characteristics

Pros:

I Simple instructions

I Fixed-length instructions

I Decoding instructions requires simple hardware

Cons:

I Programs are longer as they need more instructions

I Optimization is harder, need to be smarter

Sometimes said as “RejectImportantStuff intoCompiler”

Gabriel Laskar (EPITA) CAAL 2015 40 / 378 Processor architecture CISC & RISC architectures RISC Instruction example sub %g1, %g2, %g3

0x860x20 0x40 0x02

format destination source unused source

10 00011 000100 00001 000000000 00010

%g3 sub %g1 %g2

Gabriel Laskar (EPITA) CAAL 2015 41 / 378 Processor architecture CISC & RISC architectures CISC Characteristics Pros:

I Lots of instructions and

I A single instruction can perform complex operations

I Assembly programs are easier and shorter to write

I Code compression ratio is good

Cons:

I Binary instruction format has variable length

I It requires more complex hardware and high frequencies are harder to achieve

Modern processors often internally translate the CISC code to RISC

Gabriel Laskar (EPITA) CAAL 2015 42 / 378 Processor architecture CISC & RISC architectures CISC Instruction example

opcode opc2 mod immediate memory r/m displacement

Gabriel Laskar (EPITA) CAAL 2015 43 / 378 Memory

Part III

Memory

Gabriel Laskar (EPITA) CAAL 2015 44 / 378 Memory Memories 3: Memory

Memories Memory types Access examples

Memory accessing modes

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 45 / 378 Memory Memories Memory types 3: Memory

Memories Memory types Access examples

Memory accessing modes

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 46 / 378 Memory Memories Memory types Reasons to access memory

How does the memory work with the processor?

I Memory is used to fetch instructions,

I Memory is used to access data. I There may be one unique memory, or two.

I If there is one memory, instruction / data accesses must be sequencialized I If there are two, code cannot be accessed as data Conventional names: 1 Von Neuman architecture 2 Harvard architecture

Gabriel Laskar (EPITA) CAAL 2015 47 / 378 Memory Memories Access examples 3: Memory

Memories Memory types Access examples

Memory accessing modes

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 48 / 378 Memory Memories Access examples Instruction fetch The processor needs an instruction to process

Processor

Memory

Gabriel Laskar (EPITA) CAAL 2015 49 / 378 Memory Memories Access examples Instruction fetch The processor needs an instruction to process

Processor Instruction @

Memory

Gabriel Laskar (EPITA) CAAL 2015 49 / 378 Memory Memories Access examples Instruction fetch The processor needs an instruction to process

Processor Instruction @ Code

Memory

Gabriel Laskar (EPITA) CAAL 2015 49 / 378 Memory Memories Access examples Data access load Load the content of the memory cells pointed to by %g1 into %g2 register.

Processor

Memory

ld[%g1], %g2

Gabriel Laskar (EPITA) CAAL 2015 50 / 378 Memory Memories Access examples Data access load Load the content of the memory cells pointed to by %g1 into %g2 register.

Processor Command Data @

Memory

ld[%g1], %g2

Gabriel Laskar (EPITA) CAAL 2015 50 / 378 Memory Memories Access examples Data access load Load the content of the memory cells pointed to by %g1 into %g2 register.

Processor Command R Data Data @

Memory

ld[%g1], %g2

Gabriel Laskar (EPITA) CAAL 2015 50 / 378 Memory Memories Access examples Data access store Stores the content of %g2 register into the memory cells pointed to by %g1.

Processor

Memory

st %g2,[%g1]

Gabriel Laskar (EPITA) CAAL 2015 51 / 378 Memory Memories Access examples Data access store Stores the content of %g2 register into the memory cells pointed to by %g1.

Processor Command Data @ W Data

Memory

st %g2,[%g1]

Gabriel Laskar (EPITA) CAAL 2015 51 / 378 Memory Memory accessing modes 3: Memory

Memories

Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 52 / 378 Memory Memory accessing modes Immediate addressing 3: Memory

Memories

Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 53 / 378 In C language: int a, = ...;

a=b+ 0x831; In assembly language: add %g1, 0x831, %g2

Memory Memory accessing modes Immediate addressing Immediate addressing

I The value of data is directly stored in the instruction

I No memory access needed to get the value

Gabriel Laskar (EPITA) CAAL 2015 54 / 378 In assembly language: add %g1, 0x831, %g2

Memory Memory accessing modes Immediate addressing Immediate addressing

I The value of data is directly stored in the instruction

I No memory access needed to get the value In C language: int a, b= ...;

a=b+ 0x831;

Gabriel Laskar (EPITA) CAAL 2015 54 / 378 Memory Memory accessing modes Immediate addressing Immediate addressing

I The value of data is directly stored in the instruction

I No memory access needed to get the value In C language: int a, b= ...;

a=b+ 0x831; In assembly language: add %g1, 0x831, %g2

Gabriel Laskar (EPITA) CAAL 2015 54 / 378 Memory Memory accessing modes Immediate addressing Immediate addressing Sparc instruction details

sub %g1, 0x831, %g2

0x84 0x20 0x68 0x31

format destination opcode source immediate

10 00010 000100 00001 1 0 1000 0011 0001

%g2 sub %g1 0x831

Gabriel Laskar (EPITA) CAAL 2015 55 / 378 Memory Memory accessing modes Absolute addressing 3: Memory

Memories

Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 56 / 378 In C language: int a=*( int*)0x830; In assembly language: ld[0x830], %g1

Memory Memory accessing modes Absolute addressing Absolute addressing

I The address of the data is stored in the instruction

I A memory access is needed to get the value

Gabriel Laskar (EPITA) CAAL 2015 57 / 378 In assembly language: ld[0x830], %g1

Memory Memory accessing modes Absolute addressing Absolute addressing

I The address of the data is stored in the instruction

I A memory access is needed to get the value In C language: int a=*( int*)0x830;

Gabriel Laskar (EPITA) CAAL 2015 57 / 378 Memory Memory accessing modes Absolute addressing Absolute addressing

I The address of the data is stored in the instruction

I A memory access is needed to get the value In C language: int a=*( int*)0x830; In assembly language: ld[0x830], %g1

Gabriel Laskar (EPITA) CAAL 2015 57 / 378 Memory Memory accessing modes Register indirect addressing 3: Memory

Memories

Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 58 / 378 In C language: int a,*b= ...;

a=*b; In assembly language: ld[%g2], %g1

Memory Memory accessing modes Register indirect addressing Register indirect addressing

I The address of the data is stored in a register

I A memory access is needed to get the value

Gabriel Laskar (EPITA) CAAL 2015 59 / 378 In assembly language: ld[%g2], %g1

Memory Memory accessing modes Register indirect addressing Register indirect addressing

I The address of the data is stored in a register

I A memory access is needed to get the value In C language: int a,*b= ...;

a=*b;

Gabriel Laskar (EPITA) CAAL 2015 59 / 378 Memory Memory accessing modes Register indirect addressing Register indirect addressing

I The address of the data is stored in a register

I A memory access is needed to get the value In C language: int a,*b= ...;

a=*b; In assembly language: ld[%g2], %g1

Gabriel Laskar (EPITA) CAAL 2015 59 / 378 Memory Memory accessing modes Complex addressing 3: Memory

Memories

Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 60 / 378 Memory Memory accessing modes Complex addressing Complex addressing

I Register indirect with base register

I Register indirect with offset

I And many others... Assembly example: ld[%g2+ 0x124], %g1 ld[%g2+ %g3], %g1

Gabriel Laskar (EPITA) CAAL 2015 61 / 378 Memory Alignment 3: Memory

Memories

Memory accessing modes

Alignment Memory access alignment Structure alignment packed structures

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 62 / 378 Memory Alignment Memory access alignment 3: Memory

Memories

Memory accessing modes

Alignment Memory access alignment Structure alignment packed structures

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 63 / 378 Memory Alignment Memory access alignment Definition

Data access alignment is solely about considered data type width.

I 32-bit integer access is aligned for addresses multiple of 4

I 16-bit integer access is aligned for even addresses

I 8-bit integer (char) accesses are always aligned! Think about address % sizeof(type)

Gabriel Laskar (EPITA) CAAL 2015 64 / 378 Memory Alignment Memory access alignment Access alignment Bus Width

0x00 Address 0x01 0x34 0x56 0x78 0x04

0x08 0x89 0xab 0x0c 0xcd0xef 0x9f 0x8c 0x10 0x5a 0x6b

Gabriel Laskar (EPITA) CAAL Data read 2015 65 / 378 Memory Alignment Structure alignment 3: Memory

Memories

Memory accessing modes

Alignment Memory access alignment Structure alignment packed structures

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 66 / 378 Memory Alignment Structure alignment Structure alignment

In a structure:

I Fields must be in declaration order

I Fields must all be aligned Data alignment does not depend on architecture bus width

struct bit_packed_s { int a; 015 16 32 short b; a short c; }; b c

Gabriel Laskar (EPITA) CAAL 2015 67 / 378 I Compilers put structure fields at aligned offsets I Alignment may add unused padding bytes between fields struct example_aligned_s { char a; 015 16 32 int b; a short c; }; b

c

padding bytes

Memory Alignment Structure alignment Padding Sometimes, consecutive fields in declaration cannot be consecutive and aligned in memory

Gabriel Laskar (EPITA) CAAL 2015 68 / 378 Memory Alignment Structure alignment Padding Sometimes, consecutive fields in declaration cannot be consecutive and aligned in memory

I Compilers put structure fields at aligned offsets I Alignment may add unused padding bytes between fields struct example_aligned_s { char a; 015 16 32 int b; a short c; }; b

c

padding bytes Gabriel Laskar (EPITA) CAAL 2015 68 / 378 Memory Alignment packed structures 3: Memory

Memories

Memory accessing modes

Alignment Memory access alignment Structure alignment packed structures

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 69 / 378 Memory Alignment packed structures Basic packing

I Fields alignment can be ignored by , on request

I Few architectures are able to access non aligned fields directly

I If non-, unaligned access is emulated with multiple memory accesses, shifts, ORs, etc.

struct example_packed_s { 015 16 32 char a; a b int b; short c; c } __attribute__((packed));

Gabriel Laskar (EPITA) CAAL 2015 70 / 378 Memory Alignment packed structures Low-level packing

I Packing can even be done at bit level! I Compiler will handle shifts and masks I Can be mixed with union I Powerful for matching existing protocols struct bit_packed_s { int a:17; 015 16 32 int b:5; a b c int c:12; int :16; d e int e:14; } __attribute__((packed));

Beware of endianness Gabriel Laskar (EPITA) CAAL 2015 71 / 378 Memory Endianness 3: Memory

Memories

Memory accessing modes

Alignment

Endianness Different designs, different paradigms Explaination Demo

Caches

Gabriel Laskar (EPITA) CAAL 2015 72 / 378 Memory Endianness Different designs, different paradigms 3: Memory

Memories

Memory accessing modes

Alignment

Endianness Different designs, different paradigms Explaination Demo

Caches

Gabriel Laskar (EPITA) CAAL 2015 73 / 378 Memory Endianness Different designs, different paradigms Endianness

A data string represented with multiple bytes must be stored in memory. Similarly to written language, these bytes may be written “left-to-right” or “right-to-left”.

I Big-Endian

I Little-Endian

I Other endian modes

Gabriel Laskar (EPITA) CAAL 2015 74 / 378 Memory Endianness Explaination 3: Memory

Memories

Memory accessing modes

Alignment

Endianness Different designs, different paradigms Explaination Demo

Caches

Gabriel Laskar (EPITA) CAAL 2015 75 / 378 N = 4810310 = bbe716 4 3 2 1 0 I With b = 10: N = 4 · 10 + 8 · 10 + 1 · 10 + 0 · 10 + 3 · 10 3 2 1 0 I With b = 16: N = 11 · 16 + 11 · 16 + 14 · 16 + 7 · 16

So logically we tend to count digits from LSB: right to left Digit no 4 3 2 1 0 Base 10 value 4 8 1 0 3 Base 16 value b b e 7

Memory Endianness Explaination Endianness Mathematical reference

With a base b, a natural N may be decomposed in digits dk . If we naturally write it: N = dndn−1...dk ...d1d0 n n−1 k 1 0 N = dn · b + dn−1 · b + ··· + dk · b + ··· + d1 · b + d0 · b

Gabriel Laskar (EPITA) CAAL 2015 76 / 378 So logically we tend to count digits from LSB: right to left Digit no 4 3 2 1 0 Base 10 value 4 8 1 0 3 Base 16 value b b e 7

Memory Endianness Explaination Endianness Mathematical reference

With a base b, a natural N may be decomposed in digits dk . If we naturally write it: N = dndn−1...dk ...d1d0 n n−1 k 1 0 N = dn · b + dn−1 · b + ··· + dk · b + ··· + d1 · b + d0 · b

N = 4810310 = bbe716 4 3 2 1 0 I With b = 10: N = 4 · 10 + 8 · 10 + 1 · 10 + 0 · 10 + 3 · 10 3 2 1 0 I With b = 16: N = 11 · 16 + 11 · 16 + 14 · 16 + 7 · 16

Gabriel Laskar (EPITA) CAAL 2015 76 / 378 Memory Endianness Explaination Endianness Mathematical reference

With a base b, a natural N may be decomposed in digits dk . If we naturally write it: N = dndn−1...dk ...d1d0 n n−1 k 1 0 N = dn · b + dn−1 · b + ··· + dk · b + ··· + d1 · b + d0 · b

N = 4810310 = bbe716 4 3 2 1 0 I With b = 10: N = 4 · 10 + 8 · 10 + 1 · 10 + 0 · 10 + 3 · 10 3 2 1 0 I With b = 16: N = 11 · 16 + 11 · 16 + 14 · 16 + 7 · 16

So logically we tend to count digits from LSB: right to left Digit no 4 3 2 1 0 Base 10 value 4 8 1 0 3 Base 16 value b b e 7 Gabriel Laskar (EPITA) CAAL 2015 76 / 378 Memory Endianness Explaination Memory representation

Usually, we like to represent memory in written order, the same way we write words on a paper sheet: left to right

0 1 2 3 0x00 ’A’ ’’ ’s’ ’i’ 0x04 ’m’ ’p’ ’l’ ’e’ 0x08 ’’ ’m’ ’e’ ’s’ 0x0c ’s’ ’a’ ’g’ ’e’

Gabriel Laskar (EPITA) CAAL 2015 77 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0

3 2 1 0

0x00 0001 02 03 0x04 1011 12 13 0x08 20 21 22 23 0x0c 30 31 32 33 0x10 40 41 42 43

Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0

3 2 1 0 20 21 22 23 Big endian

0x00 0001 02 03 0x04 1011 12 13 0x08 20 21 22 23 0x0c 30 31 32 33 0x10 40 41 42 43

Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0

3 2 1 0

23 22 21 20 Little endian

0x00 0001 02 03 0x04 1011 12 13 0x08 20 21 22 23 0x0c 30 31 32 33 0x10 40 41 42 43

Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0

3 2 1 0

23 22 21 20 Little endian

0x00 0x00 03 02 01 00 0001 02 03 0x04 0x04 13 12 11 10 1011 12 13 0x08 0x08 23 22 21 20 20 21 22 23 0x0c 0x0c 33 32 31 30 30 31 32 33 0x10 0x10 43 42 41 40 40 41 42 43

Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0

3 2 1 0

Little endian 23 22 21 20

0x00 0x00 03 02 01 00 0001 02 03 0x04 0x04 13 12 11 10 1011 12 13 0x08 0x08 23 22 21 20 20 21 22 23 0x0c 0x0c 33 32 31 30 30 31 32 33 0x10 0x10 43 42 41 40 40 41 42 43

Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Endianness Do not mix everything!

I Data must be stored and fetched using the same convention.

I Don’t worry about byte order in registers, it does not make sense.

Gabriel Laskar (EPITA) CAAL 2015 79 / 378 Memory Endianness Demo 3: Memory

Memories

Memory accessing modes

Alignment

Endianness Different designs, different paradigms Explaination Demo

Caches

Gabriel Laskar (EPITA) CAAL 2015 80 / 378 Memory Endianness Demo Endianness Demo

int main(void) { unsigned int a= 0x12345678; hexdump(&a); }

Gabriel Laskar (EPITA) CAAL 2015 81 / 378 Memory Caches 3: Memory

Memories

Memory accessing modes

Alignment

Endianness

Caches

Gabriel Laskar (EPITA) CAAL 2015 82 / 378 Memory Caches Cache memory: reason

I Once upon a time, CPU and memory speed were the same. I Of course, they evolved:

I Moore’s law: CPU power doubles every 2 years, I Memory speed: +7% every year.

Gabriel Laskar (EPITA) CAAL 2015 83 / 378 Memory Caches Cache memory: definition

Cache memory is:

I a local copy of central memory,

I transparent, on-demand,

I volatile (may be flushed anytime),

I faster, closer to CPU than central memory,

I expensive!

Gabriel Laskar (EPITA) CAAL 2015 84 / 378 Memory Caches Cache

CPU CPU

L1 L1 L1 L1 Instr. Data Instr. Data

CPU−Shared L2 CPU−Shared L2 Ins + Data Ins + Data

Die−Shared L3 Ins + Data + TLB

I There are multiple caches “levels” I with different size, I with different speed, I with different latency. I Caches may be shared between code and data. Gabriel Laskar (EPITA) CAAL 2015 85 / 378 Memory Caches Cache memories side-effects

Caches may also involve stangeness:

I having copies of memory introduce a coherence problem,

I an access to a given memory location may vary,

There are various memory operators: that you handle in C code, Assembler that the compiler generated, CPU that the CPU does, In-cache that the cache does in the system.

Gabriel Laskar (EPITA) CAAL 2015 86 / 378 Memory mapping

Part IV

Memory mapping

Gabriel Laskar (EPITA) CAAL 2015 87 / 378 Memory mapping Address space 4: Memory mapping

Address space Definition Translation

Computer address spaces

Computer address space translation

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 88 / 378 Memory mapping Address space Definition 4: Memory mapping

Address space Definition Translation

Computer address spaces

Computer address space translation

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 89 / 378 Memory mapping Address space Definition Address space Definition

An address space is a set of discrete values targetting a set of objects.

I Each address points to one object

I One given object may be pointed by more than one address 1 foo 2 bar 3 baz

Gabriel Laskar (EPITA) CAAL 2015 90 / 378 Memory mapping Address space Translation 4: Memory mapping

Address space Definition Translation

Computer address spaces

Computer address space translation

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 91 / 378 Memory mapping Address space Translation Address space translation

Address space translation creates a new address space where each source address is mapped to a destination address. A given object can then be targetted by either addresses. a 1 foo b 2 bar c 3 baz

Gabriel Laskar (EPITA) CAAL 2015 92 / 378 Memory mapping Computer address spaces 4: Memory mapping

Address space

Computer address spaces Definition Usual address spaces

Computer address space translation

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 93 / 378 Memory mapping Computer address spaces Definition 4: Memory mapping

Address space

Computer address spaces Definition Usual address spaces

Computer address space translation

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 94 / 378 Memory mapping Computer address spaces Definition space Definition

A memory address space

I is contiguous

I can be mapped to another target address space (through address space translation) All memory accesses are done with respect to an address space.

Gabriel Laskar (EPITA) CAAL 2015 95 / 378 Memory mapping Computer address spaces Usual address spaces 4: Memory mapping

Address space

Computer address spaces Definition Usual address spaces

Computer address space translation

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 96 / 378 Memory mapping Computer address spaces Usual address spaces Computer address spaces

CPU Address space available through a CPU register. Current machines have 32 or 64 bit pointer registers Physical Address space actually wired between hardware components. Current machines have physical address spaces around 40 bits. Virtual Address space reachable by a process. Generally 32 or 64 bits.

Gabriel Laskar (EPITA) CAAL 2015 97 / 378 Memory mapping Computer address spaces Usual address spaces Physical memory

Physical memory provides the lowest accessible address space in computer. Physical RAM is usually accessible as a small memory subset of the physical address space. Physical memory can be mapped in different ways depending on physical address bus implementation:

I Accessible at a given location,

I Scattered at multiple locations,

I Accessible (many times) at multiple locations.

Gabriel Laskar (EPITA) CAAL 2015 98 / 378 Memory mapping Computer address space translation 4: Memory mapping

Address space

Computer address spaces

Computer address space translation Segmentation Pagination

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 99 / 378 Memory mapping Computer address space translation Segmentation 4: Memory mapping

Address space

Computer address spaces

Computer address space translation Segmentation Pagination

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 100 / 378 Memory mapping Computer address space translation Segmentation Memory segments

Memory segments define an address space as a sub-region of another address space. It is mostly defined by the following attributes

I Segment base address in target address space,

I Segment size,

I Segment type and access rights.

Gabriel Laskar (EPITA) CAAL 2015 101 / 378 Memory mapping Computer address space translation Segmentation Simple segment

Segmentation keeps memory locations contiguous.

0

Source address space

0

Destination address space

Gabriel Laskar (EPITA) CAAL 2015 102 / 378 Memory mapping Computer address space translation Segmentation Physical address space

0 Physical address space

0

Physical memory

Gabriel Laskar (EPITA) CAAL 2015 103 / 378 Memory mapping Computer address space translation Segmentation Single mapped RAM

0 Physical address space

addr[x

0

Physical memory

Gabriel Laskar (EPITA) CAAL 2015 104 / 378 Memory mapping Computer address space translation Segmentation Single mapped RAM

0 Physical address space

addr[x>=n] = ?

0 n

Physical memory ?

Gabriel Laskar (EPITA) CAAL 2015 104 / 378 Memory mapping Computer address space translation Segmentation Multiple/loop mapped RAM

0 Physical address space

addr[x] = mem[x%n]

0

Physical memory

Gabriel Laskar (EPITA) CAAL 2015 105 / 378 Memory mapping Computer address space translation Segmentation Memory access using segments

To access memory through a defined segment, the CPU performs the following tasks:

I Check requested address against the segment size,

I Add the segment base address to the requested memory address,

I Check access rights.

Gabriel Laskar (EPITA) CAAL 2015 106 / 378 Memory mapping Computer address space translation Segmentation Memory access using segments Memory access

0

Process address space 0

Target address space

Segment base Access address offset

Effective address Gabriel Laskar (EPITA) CAAL 2015 107 / 378 Memory mapping Computer address space translation Segmentation Segment descriptor table

0 Segment 0

CPU 0 Segment 2 Register 1 0 Segment 3

0

Target base size address 0 12 space − − 6 3 9 3

Gabriel Laskar (EPITA) CAAL 2015 108 / 378 Memory mapping Computer address space translation Segmentation Limitations of segmentation

Segmentation is a great thing, but is has a few limitations:

I Address-space must be mapped in contiguous blocks

I Thus segments are difficult to grow on-demand

I A whole segment must be present in the target address space

Gabriel Laskar (EPITA) CAAL 2015 109 / 378 Memory mapping Computer address space translation Pagination 4: Memory mapping

Address space

Computer address spaces

Computer address space translation Segmentation Pagination

MMU patterns

Gabriel Laskar (EPITA) CAAL 2015 110 / 378 Memory mapping Computer address space translation Pagination Memory pages Modern operating systems need more than segmentation. Basic idea is to split the source address space in pages, and map each source page to a target page. Source address 0 space

0

Target address space

Gabriel Laskar (EPITA) CAAL 2015 111 / 378 Memory mapping Computer address space translation Pagination Pages mapping

Memory pages need to be mapped using a specific descriptor table recognized by the CPU. Usual attributes for a page are

I address in target address space,

I type and access rights,

I page size,

I other attributes (cacheability, coherence, ...)

Gabriel Laskar (EPITA) CAAL 2015 112 / 378 Memory mapping Computer address space translation Pagination Pages descriptor table

Source address 0 space

CPU Register 1

0

Target address − space 5 3 6 10 − 4 − −

Gabriel Laskar (EPITA) CAAL 2015 113 / 378 Memory mapping Computer address space translation Pagination Pages mapping Rationale

Splitting memory in pages allows more powerful :

I Address spaces may be mapped to uncontiguous target pages, allowing memory fragmentation.

I Many interesting operations can be performed on pages (sharing, swapping, ...)

Gabriel Laskar (EPITA) CAAL 2015 114 / 378 Memory mapping MMU patterns 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 115 / 378 Memory mapping MMU patterns Memory protection 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 116 / 378 Memory mapping MMU patterns Memory protection Memory protection Motivations

Why do we need memory protection?

Gabriel Laskar (EPITA) CAAL 2015 117 / 378

Memory mapping MMU patterns Memory protection Memory protection

In order to execute secure , hardware has to provide memory protection systems. This protects

I the system from the hosted processes

I hosted processes from each other It may even protect the system from its own components in a Micro-Kernel approach.

Gabriel Laskar (EPITA) CAAL 2015 119 / 378 Memory mapping MMU patterns Memory protection Memory protection How?

Several memory access checks are performed by the CPU, for each access, transparently:

I address bounds validity,

I privilege level,

I operation type.

Gabriel Laskar (EPITA) CAAL 2015 120 / 378 Memory mapping MMU patterns Memory protection Memory protection Where?

I In a segmentation-based system, priviliges are per segment.

I In a pagination-based system, privileges are in page descriptor table, with a page granularity. x86 mixes both with segmentation on top of pagination.

Gabriel Laskar (EPITA) CAAL 2015 121 / 378 Memory mapping MMU patterns Privileges 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 122 / 378 Memory mapping MMU patterns Privileges Privilege levels Privilege level keeps memory from being accessed by non-authorized code. Different CPUs define different privilege levels.

Instruction

Illegal Allowed memory memory access access

Code Data Data segment/page segment/page segment/page level 1 level 0 level 1

Target address space

Gabriel Laskar (EPITA) CAAL 2015 123 / 378 Memory mapping MMU patterns Privileges Operation type Operation types checking keeps code from doing unwanted memory operations.

Instruction

Execution Illegal Illegal Illegal Read Write Read ok write read write ok ok ok

Code Data Data segment/page segment/page segment/page (exec) (read) (read/write)

Target address space

Gabriel Laskar (EPITA) CAAL 2015 124 / 378 Memory mapping MMU patterns Process switching 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 125 / 378 Memory mapping MMU patterns Process switching Process switching

Pagination systems are easier to use with a separate memory address space for each process running on a computer:

I Switching process space implies changing the page descriptor table.

I Each process has its own page descriptor table ready to be used by the CPU.

I Only the CPU register pointing to this table has to be changed to setup a new address space.

Gabriel Laskar (EPITA) CAAL 2015 126 / 378 Memory mapping MMU patterns Process switching Process switching

Process A Process B address space address space

Target address space

Process A Process B page table page table

CPU register

Gabriel Laskar (EPITA) CAAL 2015 127 / 378 Memory mapping MMU patterns Memory sharing 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 128 / 378 Memory mapping MMU patterns Memory sharing Memory sharing

The memory pagination can be used to share pages between processes. A single page can be mapped in several process address spaces to permit different behaviors:

I Save physical memory by not duplicating shared code and read-only data memory (used for shared libraries),

I Use shared memory for inter-process communication,

Gabriel Laskar (EPITA) CAAL 2015 129 / 378 Memory mapping MMU patterns Memory sharing Memory sharing

Process A Process B address space address space 0 1 2 3 0 1 2

0 1 2 3 4 ...... 18 Target address space

Process A Process B page table page table

Gabriel Laskar (EPITA) CAAL 2015 130 / 378 Memory mapping MMU patterns Copy On Write 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 131 / 378 Memory mapping MMU patterns Copy On Write Copy On Write

Copy On Write (COW) is a powerful trick used in many situations.

One of the most usual ones is fork() where the whole address space of a process has to be cloned.

Basic idea is to make the whole memory read-only and actually copy only when necessary, as late as possible.

Gabriel Laskar (EPITA) CAAL 2015 132 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0

0

− − 3 6 10 − 4 − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0

0

− − 3 6 10 − 4 − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0

0

− − − − 3 3 6 6 10 10 − − 4 4 − − − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0 0

0

− − − − 3 3 6 6 10 10 − − 4 4 − − − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0 0

0

− − − − 3 3 6 6 10 10 − − 4 4 − − − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0 0

0

− − − − 3 3 6 6 10 10 − − 4 4 − − − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0 0

0

− − − − 3 3 6 6 10 10 − − 2 4 − − − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step

0 0

0

− − − − 3 3 6 6 10 10 − − 2 4 − − − −

Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Page swapping 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 134 / 378 Memory mapping MMU patterns Page swapping Page swapping

Page swapping is a mechanism artificially enlarging Physical memory with a part of the hard-disk.

Gabriel Laskar (EPITA) CAAL 2015 135 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 6 10 − 4 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 6 10 − 4 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 6 10 − 4 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 6 10 − 4 S: 3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 6 10 − 4 S: 3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 disk 10 − 4 S: 3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 disk 10 − 4 S: 3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 disk 10 − 4 S: 3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 7 10 − 4 S: 3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 7 10 − 4 S: 3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns mmap() 4: Memory mapping

Address space

Computer address spaces

Computer address space translation

MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap()

Gabriel Laskar (EPITA) CAAL 2015 137 / 378 Memory mapping MMU patterns mmap() mmap()

Make part of the memory exactly match the contents of a file.

I Reflect changes immediatly between processes having file opened

I Permit different protections on different parts of the file

I Lazily load parts of file

I Lazily write parts back

Gabriel Laskar (EPITA) CAAL 2015 138 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 space

CPU Register 1

0

Target address − space 5 3 − − − − − Disk −

Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space

CPU Register 1

0

Target address − space 5 3 f:0 f:1 f:2 f:3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space

CPU Register 1

0

Target address − space 5 3 f:0 f:1 f:2 f:3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space

CPU Register 1

0

Target address − space 5 3 foo.bin:3 f:0 f:1 f:2 f:3 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space

CPU Register 1

0

Target address − space 5 3 f:0 f:1 f:2 8 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space

CPU Register 1

0

Target address − space 5 3 f:0 f:1 f:2 8 − Disk −

Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Execution flow

Part V

Execution flow

Gabriel Laskar (EPITA) CAAL 2015 140 / 378 Execution flow Branch, function calls 5: Execution flow

Branch, function calls Branch principle Pipeline considerations

Function calls

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 141 / 378 Execution flow Branch, function calls Branch principle 5: Execution flow

Branch, function calls Branch principle Pipeline considerations

Function calls

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 142 / 378 Execution flow Branch, function calls Branch principle Branch principle

Branching is breaking the normal incremental execution flow to go execute code somewhere else. A branch has

I a destination,

I optionally a condition,

I optionally a “link” feature, saving return point.

Gabriel Laskar (EPITA) CAAL 2015 143 / 378 Execution flow Branch, function calls Branch principle Branch offset

Instructions are fetched from memory to CPU by dereferencing the (%pc register)

I in normal execution flow, the %pc is auto-incremented Addresses Instructions

80450010 add %g1, %g2, %g3 80450014 mov %g0, %g1 80450018 xor %g2, %g3, %g2 Executed instruction %pc 8045001C sub %g3, %g2, %g1 Next instruction 80450020 ...

I when a branch occurs, the %pc is affected in two possible ways: relative branch: a constant offset is added to the %pc absolute branch: an absolute address is loaded into the %pc

Gabriel Laskar (EPITA) CAAL 2015 144 / 378 Execution flow Branch, function calls Branch principle Unconditional branch

The %pc register is always modified.

I Explicit jump:

I Explicit infinite loop, optimized by the compiler

Addresses Instructions

80450010 b 8045011C Executed instruction 80450014 mov %g0, %g1

80450118 xor %g2, %g3, %g2 %pc 8045011C sub %g3, %g2, %g1 Next instruction 80450120 ...

Gabriel Laskar (EPITA) CAAL 2015 145 / 378 Execution flow Branch, function calls Branch principle Unconditional branch C listing

#include

void main(void) { do { puts("hello"); } while (1); }

Gabriel Laskar (EPITA) CAAL 2015 146 / 378 Execution flow Branch, function calls Branch principle Unconditional branch Control flow graph main

puts("hello");

return

Gabriel Laskar (EPITA) CAAL 2015 147 / 378 Execution flow Branch, function calls Branch principle Conditional branch

The %pc register is modified only if the condition is verified

I if statements

I loop (for, while) statements

Addresses Instructions

80450010 bz 8045011C Executed instruction 80450014 mov %g0, %g1 Potential next instruction 80450018 ... %pc 80450118 xor %g2, %g3, %g2 8045011C sub %g3, %g2, %g1 Potential next instruction 80450120 ...

Gabriel Laskar (EPITA) CAAL 2015 148 / 378 Execution flow Branch, function calls Branch principle Conditional branch C listing

#include

void main(void) { int i= 10;

do { printf("%i\n",--i); } while (i !=0); }

Gabriel Laskar (EPITA) CAAL 2015 149 / 378 Execution flow Branch, function calls Branch principle Conditional branch Control flow graph

main

i = 10

−−i

printf("%i\n");

yes i != 0

no

return Gabriel Laskar (EPITA) CAAL 2015 150 / 378 Execution flow Branch, function calls Branch principle Conditions

The decision to take the branch is based on register contents:

I conditional branch may occur if a specific bit is set in a register

I conditional branch may occur if a specific bit is clear in a register

I conditional branch may occur if a register equals a specific value (usually zero)

Gabriel Laskar (EPITA) CAAL 2015 151 / 378 Execution flow Branch, function calls Branch principle Complete examples

1. strlen 2. pgcd

Gabriel Laskar (EPITA) CAAL 2015 152 / 378 Execution flow Branch, function calls Pipeline considerations 5: Execution flow

Branch, function calls Branch principle Pipeline considerations

Function calls

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 153 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations

A branch may break the pipeline, slowing the execution 1. when things goes wrong, a bubble is created 2. sometimes, the processor succeeds in predicting the branch target 3. when a misprediction occurs, a bubble is created

Gabriel Laskar (EPITA) CAAL 2015 154 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble

When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions:

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble

When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions:

sub Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble

When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions:

sub Fetch Decod Exec

not Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble

When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions:

sub Fetch Decod Exec Mem

not Fetch Decod Exec

add Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble

When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions:

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem

add Fetch Decod Exec

jmp Fetch Decod ! Instruction sequence ? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble

When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions:

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem Write

add Fetch Decod Exec Mem

jmp Fetch Decod Exec Instruction sequence −− Fetch −−

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble

When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions:

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem Write

add Fetch Decod Exec Mem Write

jmp Fetch Decod Exec Mem Instruction sequence −− Fetch −− −−

add Fetch Decod

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed.

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem

add Fetch Decod Exec

jmp Fetch Decod Instruction sequence ? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Delay slot A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed.

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem

add Fetch Decod Exec

jmp Fetch Decod

Instruction sequence instruction following ? jmp is executed Fetch execution flow is modified one cycle later

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Delay slot A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed.

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem Write

add Fetch Decod Exec Mem

jmp Fetch Decod Exec

Instruction sequence instruction following sub jmp is executed Fetch Decod execution flow ? is modified one cycle later Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Delay slot A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed.

sub Fetch Decod Exec Mem Write

not Fetch Decod Exec Mem Write

add Fetch Decod Exec Mem Write

jmp Fetch Decod Exec Mem

Instruction sequence instruction following sub jmp is executed Fetch Decod Exec execution flow add is modified one cycle later Fetch Decod

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec

ld g5, [g2] Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec Mem

ld g5, [g2] Fetch Decod Exec ! add g6, 0x2a, g5 Fetch Decod

? Fetch Instruction sequence

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec Mem Write

ld g5, [g2] Fetch Decod Exec Mem ! add g6, 0x2a, g5 Fetch Decod Exec

Fetch Decod Instruction sequence Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec Mem Write

ld g5, [g2] Fetch Decod Exec Mem ! add g6, 0x2a, g5 Fetch Decod Exec

Fetch Decod Instruction sequence Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec Mem Write

ld g5, [g2] Fetch Decod Exec Mem ! add g6, 0x2a, g5 Fetch Decod Exec

add g6, 0x2a, g5 Fetch Decod Instruction sequence Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec Mem Write

ld g5, [g2] Fetch Decod Exec Mem Write

add g6, 0x2a, g5 Fetch Decod (Exec) −−

add g6, 0x2a, g5 Fetch (Decod) Exec Instruction sequence ld i3, [g6 + 0x8] (Fetch) Decod

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec Mem Write

ld g5, [g2] Fetch Decod Exec Mem Write

add g6, 0x2a, g5 Fetch Decod (Exec) −−

add g6, 0x2a, g5 Fetch (Decod) Exec Instruction sequence ld i3, [g6 + 0x8] (Fetch) Decod

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression)

sub g1, g2, g3 Fetch Decod Exec Mem Write

ld g5, [g2] Fetch Decod Exec Mem Write

add g6, 0x2a, g5 Fetch Decod (Exec) −− −−

Fetch (Decod) Exec Mem Instruction sequence ld i3, [g6 + 0x8] (Fetch) Decod Exec

xor i4, 0x2a, i3 Fetch Decod

? Fetch

Pipeline timeline

Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Function calls 5: Execution flow

Branch, function calls

Function calls Principles Argument passing Call conventions

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 158 / 378 Execution flow Function calls Principles 5: Execution flow

Branch, function calls

Function calls Principles Argument passing Call conventions

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 159 / 378 Execution flow Function calls Principles Function calls principle A function is a piece of code that returns a value depending on the parameters the user specifies as inputs, if any.

Addresses Instructions Instructions Addresses

8004500

8002000 8004504

8002004

8004508 8002008

800450B

Gabriel Laskar (EPITA) CAAL 2015 160 / 378 Execution flow Function calls Principles “call” instruction

I Saves next instruction address (the return path),

I Jumps to function

Addresses Instructions Return address 80450018 %i7 80450010 call 80002000 Executed instruction 80450014 mov %g0, %g1 Delay slot instruction 80450018 xor %g2, %g1, %g3

%pc 80002000 add %g1, %g2, %g3 Next instruction 80002004 ...

Gabriel Laskar (EPITA) CAAL 2015 161 / 378 Execution flow Function calls Principles “ret” instruction

I Stores the result in a dedicated register

I Restores the previously-saved PC address

Addresses Instructions

Return address 800020F8 ret Executed instruction %i7 800020FC not %g1, %g2 Delay slot instruction 80002100 ... 80450018 80450010 call 80002000 80450014 mov %g0, %g1 %pc 80450018 xor %g2, %g1, %g3 Next instruction

Gabriel Laskar (EPITA) CAAL 2015 162 / 378 Execution flow Function calls Argument passing 5: Execution flow

Branch, function calls

Function calls Principles Argument passing Call conventions

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 163 / 378 Execution flow Function calls Argument passing How to pass arguments and return values?

We need a space of shared data between the caller and the callee.

I From caller to callee, for arguments

I From callee to caller, for return values Some arguments are purely for execution purposes:

I context pointers (stack, globals, ...)

I return address

Gabriel Laskar (EPITA) CAAL 2015 164 / 378 Execution flow Function calls Argument passing Simplest case

I Non-nested function call

I Less arguments than available machine registers

Gabriel Laskar (EPITA) CAAL 2015 165 / 378 Execution flow Function calls Argument passing Through registers

On most RISC architectures, there are registers dedicated to argument storage:

I Each argument is stored and preserved directly in a register

I Local variables may be held in another set of registers Example: on SPARC architecture, the CPU has:

I 8 registers (%g0, ... %g7) dedicated to global variables

I 8 registers (%i0, ... %i7) dedicated to input arguments

I 8 registers (%l0, ... %l7) dedicated to local variables

I 8 registers (%o0, ... %o7) dedicated to output arguments A callee’s “input” registers are shared with caller’s “output” registers.

Gabriel Laskar (EPITA) CAAL 2015 166 / 378 Execution flow Function calls Argument passing Through global memory

Principle:

I Each argument is stored and preserved directly in memory

I Each local variable is held in memory

Gabriel Laskar (EPITA) CAAL 2015 167 / 378 Execution flow Function calls Call conventions 5: Execution flow

Branch, function calls

Function calls Principles Argument passing Call conventions

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 168 / 378 Execution flow Function calls Call conventions Call conventions

I How to deal with huge amount of variables and arguments?

I How to deal with recursive calls and nested function calls?

I How to deal with variables bigger than registers?

I How to deal with “...” (like printf)?

I How to deal with dedicated registers (floats)?

Gabriel Laskar (EPITA) CAAL 2015 169 / 378 Execution flow Function calls Call conventions Problem

Consider a recursive function that needs local variables:

I Each time the function is called, a new context must be allocated to preserve each local variable

I Each time the function returns, the previous context must be restored

I The function may call itself an unpredictable number of times

I These local variables cannot be reserved in a static memory space (as global variables are).

Gabriel Laskar (EPITA) CAAL 2015 170 / 378 Execution flow Function calls Call conventions Abstract context stack

Function calls void foo(int a) { return foo(a + 1); } Local context

Local context

Local context

Function returns

Gabriel Laskar (EPITA) CAAL 2015 171 / 378 Execution flow Function calls Call conventions Register window

Hardware context stack implementation:

I Uses a large amount of registers (example: 512 on Sparc)

I Uses a logical limitation to define multiple contexts

I Does context change by sliding a window on each function call

Gabriel Laskar (EPITA) CAAL 2015 172 / 378 Execution flow Function calls Call conventions Principle

Abstract stack layout Hardware registers Sliding register window Function calls

%R120 %R119 Next Local %R118 context %R117 %R116 %o0 − %o7 %R115 Current Local %l0 − %l7 context %R102 %R101 %i0 − %i7 %R100 Previous Local %R99 context %R98 %R97 %R96 %R95 Function returns

Gabriel Laskar (EPITA) CAAL 2015 173 / 378 Execution flow Function calls Call conventions Sparc overlapping register window

Function call Input registers %i0 ... %i7

register Local registers window %l0 ... %l7 Called fonctions context

Output registers Register Input registers %o0 ... %o7 overlap %i0 ... %i7

Calling fonctions context Local registers register %l0 ... %l7 window

Output registers %o0 ... %o7

Gabriel Laskar (EPITA) CAAL 2015 174 / 378 Execution flow Function calls Call conventions and epilogue

prologue: slide register window Example: save %sp,-96, %sp epilogue: slide back register window (i.e. restore previous context). Example: restore

Gabriel Laskar (EPITA) CAAL 2015 175 / 378 Execution flow Function calls Call conventions Memory stack

Register window has hard limitations:

I The call depth is limited to the total amount of registers

I The CPU needs a large amount of physical registers ⇒ expensive On most systems, memory is used to implement a cheaper stack.

Gabriel Laskar (EPITA) CAAL 2015 176 / 378 Execution flow Function calls Call conventions Principle

Abstract stack layout Sliding stack frame Process memory Function calls

Called function Stack Next arguments memory Local space context Stack pointer Local Current variables Current stack Local frame context Frame Return address pointer

Input Previous arguments Local context

Function returns

Gabriel Laskar (EPITA) CAAL 2015 177 / 378 Execution flow Function calls Call conventions Nested function calls Stack memory space

Function calls Local variables Return address

Input Called function arguments arguments Stack pointer Local variables Current function Return address Frame context pointer Called function Input arguments arguments

Local variables

Return address Gabriel Laskar (EPITA) CAAL 2015 178 / 378 Execution flow Function calls Call conventions Function prologue and epilogue

prologue: saves previous frame pointer, set new frame pointer, reserve space on memory stack for local variables Example: a function that needs three 32 bits local variables (12 bytes) on its stack: [%sp] <- %fp %fp <- %sp %sp <- %sp- 12 epilogue: restores previous frame and stack pointers (i.e. restore previous context). Example: %sp <- %fp %fp <-[%fp]

Gabriel Laskar (EPITA) CAAL 2015 179 / 378 Execution flow Function calls Call conventions Argument and local variable access

argument: Dereference the address “frame pointer + argument offset”: ld[%fp+ 16], %g1 ; Access to arg_a local variable: Dereference the address “frame pointer - local variable offset” ld[%fp-4], %g1 ; Access to var_b

Gabriel Laskar (EPITA) CAAL 2015 180 / 378 Execution flow Function calls Call conventions Argument and local variable access schema

Stack pointer int var_c local int var_b Frame variables pointer int var_a previous fp void foo(int arg_a, int arg_b) ret address { int arg_b function int var_a, var_, var_c; int arg_a arguments ...

}

previous fp ret address

Gabriel Laskar (EPITA) CAAL 2015 181 / 378 Execution flow Handling events 5: Execution flow

Branch, function calls

Function calls

Handling events Events System calls Faults Hardware interruptions

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 182 / 378 Execution flow Handling events Events 5: Execution flow

Branch, function calls

Function calls

Handling events Events System calls Faults Hardware interruptions

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 183 / 378 Execution flow Handling events Events Events

What are events?

How to handle them?

Gabriel Laskar (EPITA) CAAL 2015 184 / 378 Execution flow Handling events Events Categorizing events

An is caused by an external event. An exception is caused by instruction execution.

Unplanned Deliberate Synchronous fault syscall trap, software interrupt Asynchronous hardware interruption

→ An incoming event must be executed with a high priority.

Gabriel Laskar (EPITA) CAAL 2015 185 / 378 Execution flow Handling events Events User space / kernel space A trap is a critical event that must be handled by the kernel in a safe memory space, the kernel space.

User mode Kernel mode

Kernel code Kernel Invalid memory Kernel space stack

User code User code

User data User data Process memory space

Usr.stack Usr.stack

Gabriel Laskar (EPITA) CAAL 2015 186 / 378 Execution flow Handling events Events Trap handler table The processor jumps to a trap routine defined by the operating system. The trap routine address is stored in a dedicated descriptor table.

Kernel mode

Kernel code Interruption table Kernel Interruption data table address User code register User data

Usr.stack trap #4

Gabriel Laskar (EPITA) CAAL 2015 187 / 378 Execution flow Handling events System calls 5: Execution flow

Branch, function calls

Function calls

Handling events Events System calls Faults Hardware interruptions

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 188 / 378 Execution flow Handling events System calls System calls

The mechanism can be used by a process to request services from the operating system:

I executing a process, exiting

I reading input, writing output (read, write...)

I performing restricted actions such as accessing hardware devices or accessing the memory management unit.

I etc, ...

Gabriel Laskar (EPITA) CAAL 2015 189 / 378 Execution flow Handling events System calls Processor modes

Hardware facilities have been implemented to ensure process isolation in multi-user/multi-processor environment. Today, processors have two (or more) execution modes: user mode dedicated to user applications supervisor mode reserved for operating system kernel

Gabriel Laskar (EPITA) CAAL 2015 190 / 378 Execution flow Handling events System calls Execution permissions

For security sake, some operations are restricted to kernel space:

I peripheral input and output

I low-level memory management System calls allow user to switch to kernel space and perform restricted actions, under certain conditions. The kernel must check system call parameters validity.

Gabriel Laskar (EPITA) CAAL 2015 191 / 378 Execution flow Handling events System calls System call implementation

1. System calls use a dedicated instruction which causes the processor to change mode to superuser (or protected) mode. 2. Each system call is indexed by a single number: the syscall trap handler makes an indirect call through the system call dispatch table to the handler for the specific system call.

Gabriel Laskar (EPITA) CAAL 2015 192 / 378 Execution flow Handling events System calls Execution path System calls run in kernel mode on the kernel memory space. User mode Kernel mode User system code call context switch Kernel code system return context switch User code Gabriel Laskar (EPITA) CAAL 2015 193 / 378 Execution flow Handling events System calls libc example

Standart C ’s read, write, pipe, ... functions are wrappers to corresponding system calls: _SYSENTRY(pipe) mov %o0, %o2 mov SYS_pipe, %g1 ta %xcc, ST_SYSCALL bcc,a,pt %xcc,1f stw %o0,[%o2] ERROR() 1: stw %o1,[%o2+4] retl clr %o0 _SYSEND(pipe)

Gabriel Laskar (EPITA) CAAL 2015 194 / 378 Execution flow Handling events Faults 5: Execution flow

Branch, function calls

Function calls

Handling events Events System calls Faults Hardware interruptions

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 195 / 378 Execution flow Handling events Faults Faults

How to handle divide by zero? How to handle overflow? How to handle ...

Gabriel Laskar (EPITA) CAAL 2015 196 / 378 Execution flow Handling events Faults Similarities with system calls

Faults are similar to system calls in some respects:

I Faults occur as a result of a process executing an instruction

I The kernel exception handler may return to the faulty user context But faults are different in other respects:

I Syscalls are deliberate, faults are unexpected

I Not every execution of the instruction results in a fault

Gabriel Laskar (EPITA) CAAL 2015 197 / 378 Execution flow Handling events Faults Handling a fault

Different actions may be taken by the operating system in response to faults:

I kill the user process

I notify the process that a fault occurred (so it may recover in its own way)

I solve the problem and resume the process transparently

Gabriel Laskar (EPITA) CAAL 2015 198 / 378 Execution flow Handling events Faults Execution path User mode Kernel mode Instruction User fault code context save Kernel code context restore User process code

Gabriel Laskar (EPITA) CAAL 2015 199 / 378 Execution flow Handling events Faults Events interface

Unix systems can notify a user program of a fault with a signal. Signals are also used for other forms of asynchronous event notifications.

Gabriel Laskar (EPITA) CAAL 2015 200 / 378 Execution flow Handling events Hardware interruptions 5: Execution flow

Branch, function calls

Function calls

Handling events Events System calls Faults Hardware interruptions

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 201 / 378 Execution flow Handling events Hardware interruptions Hardware interruptions

How to handle devices interruptions?

Gabriel Laskar (EPITA) CAAL 2015 202 / 378 Execution flow Handling events Hardware interruptions Execution path User mode Kernel mode User hardware code interruption context save Kernel code interruption return context load User code

Gabriel Laskar (EPITA) CAAL 2015 203 / 378 Execution flow Multitasking 5: Execution flow

Branch, function calls

Function calls

Handling events

Multitasking

Gabriel Laskar (EPITA) CAAL 2015 204 / 378 Execution flow Multitasking Multitasking

A single process may not use system resources at full capacity. The idea of multitasking is to simulate the execution of concurrent execution of many processes, using a single processor. The operating system has to:

I create and delete processes,

I organize processes in memory,

I schedule processes for CPU use.

Gabriel Laskar (EPITA) CAAL 2015 205 / 378 Execution flow Multitasking Memory mapping A process addressing B process addressing space space

Pysical memory space

A process B process memory map memory map

Page table address register

Gabriel Laskar (EPITA) CAAL 2015 206 / 378 Execution flow Multitasking Context switching Memory contexts Real backup CPU registers

A Reg 1 process Reg 2 registers

B process FlagReg registers PageReg

Gabriel Laskar (EPITA) CAAL 2015 207 / 378 Execution flow Multitasking Process life Timer interruption

context load context save

Kernel

Process A Process B Process C

Gabriel Laskar (EPITA) CAAL 2015 208 / 378 This approach requires a way of handling events. Object file formats

Part VI

Object file formats

Gabriel Laskar (EPITA) CAAL 2015 209 / 378 Object file formats Build process 6: Object file formats

Build process Overview Development tools Analysis tools

Binary formats

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 210 / 378 Object file formats Build process Overview 6: Object file formats

Build process Overview Development tools Analysis tools

Binary formats

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 211 / 378 Object file formats Build process Overview The big picture

C source pure C Assembly code code file cpp cc1 as .c .i .s .o

.h

Header files ld a.out .o

.o

Gabriel Laskar (EPITA) CAAL 2015 212 / 378 Object file formats Build process Overview

C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o

Header files

I Also called -processor I Transforms a text (source) file into another text (source) file:

I merges many files into one (#include) I replaces macros by their definitions (#define) I removes code considering conditions (#if*) Example: cpp

I Input: C source file with directives

I Output: “pure” C source file

Gabriel Laskar (EPITA) CAAL 2015 213 / 378 Object file formats Build process Overview C compiler

C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o

Header files

I Scans and parses the source file

I Analyzes the source (type checking)

I Generates assembly code (another source file) for the target architecture

Gabriel Laskar (EPITA) CAAL 2015 214 / 378 Object file formats Build process Overview Assembler

C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o

Header files

I Translates instructions in binary opcode sequence for the target architecture

I Collects symbols and resolve addresses

I Writes an object file The assembler source file will be different depending on architecture!

Gabriel Laskar (EPITA) CAAL 2015 215 / 378 Object file formats Build process Overview

C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o

Header files

I Merges (many) object files

I Resolves external symbols

I Computes addresses

I Writes an executable file

Gabriel Laskar (EPITA) CAAL 2015 216 / 378 Object file formats Build process Development tools 6: Object file formats

Build process Overview Development tools Analysis tools

Binary formats

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 217 / 378 Object file formats Build process Development tools Development tools

I SPARC assembler: use aasm

I Project at http://savannah.nongnu.org/projects/aasm I Sparc, Mips, PPC, ARM, ... architecture

I SoCLib: https://www.soclib.fr I QEmu

Gabriel Laskar (EPITA) CAAL 2015 218 / 378 Object file formats Build process Analysis tools 6: Object file formats

Build process Overview Development tools Analysis tools

Binary formats

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 219 / 378 Object file formats Build process Analysis tools

I Displays object (executable) file tables, on host architecture

I Disassembles object (executable) file code-sections I Useful options:

I -h: display the section headers I -t: display the symbol tables I -dx: disassemble

Gabriel Laskar (EPITA) CAAL 2015 220 / 378 Object file formats Build process Analysis tools readelf

I Displays ELF object (executable) file tables, on any architecture

Gabriel Laskar (EPITA) CAAL 2015 221 / 378 Object file formats Binary formats 6: Object file formats

Build process

Binary formats Simple binary formats How about code reuse and splitting Linking File formats history

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 222 / 378 Object file formats Binary formats Simple binary formats 6: Object file formats

Build process

Binary formats Simple binary formats How about code reuse and splitting Linking File formats history

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 223 / 378 Object file formats Binary formats Simple binary formats Simple binary formats

Consider an object program that does not need other object programs to build a binary program:

I the assembler collects code markers (labels)

I the assembler resolves all labels and replaces all of them in the code

Gabriel Laskar (EPITA) CAAL 2015 224 / 378 Object file formats Binary formats Simple binary formats Flat program

The labels are immediately replaced and written by the assembler in the binary file:

Address Opcodes 00000000 90 00000001 B8 78 56 34 12 mov eax, 0x12345678 00000006 E8 00 12 00 00 call my_function

... my_function: 00001200 90 nop 00001201 ...

Gabriel Laskar (EPITA) CAAL 2015 225 / 378 Object file formats Binary formats Simple binary formats Flat binary format

In raw binary format, the whole program is written directly in file. Useful at boot time: the processor can only read raw code and there is no binary .

Gabriel Laskar (EPITA) CAAL 2015 226 / 378 Object file formats Binary formats How about code reuse and splitting 6: Object file formats

Build process

Binary formats Simple binary formats How about code reuse and splitting Linking File formats history

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 227 / 378 Object file formats Binary formats How about code reuse and splitting Challenges

What happens when a function must be imported?

What happens when a function must be exported?

Gabriel Laskar (EPITA) CAAL 2015 228 / 378 Object file formats Binary formats How about code reuse and splitting Complex program

When the symbol of a called function is unresolved, the assembler leaves the call destination address undefined:

Address Opcodes Source code 00000000 90 nop 00000001 B8 78 56 34 12 mov eax, 0x12345678 00000006 E8 00 12 00 00 call my_function 00000006 E8 00 12 00 00 call printf ... External my_function: reference 00001200 90 nop 00001201 ...

⇒ The binary format must hold information on created “holes”

Gabriel Laskar (EPITA) CAAL 2015 229 / 378 Object file formats Binary formats How about code reuse and splitting Needs

To fill the holes left by the assembler later on, the binary file must hold:

I A symbol table, which stores each label identifier,

I A table, which shows all the remaining “holes”

I Labels associated to each “hole”. More generaly, a binary file format must also hold:

I A header containing general informations needed to access various parts of the file,

I Several sections holding code and data (raw data).

Gabriel Laskar (EPITA) CAAL 2015 230 / 378 Object file formats Binary formats How about code reuse and splitting Abstraction of a binary file format

file header .text

.rodata

.data

.bss (not in file) section table reloc. table

symbol table

string table

headers object file sections

(Information regarding sections, symbol table etc. are usually identified within the file header)

Gabriel Laskar (EPITA) CAAL 2015 231 / 378 Object file formats Binary formats How about code reuse and splitting The symbol table

The symbol table associates each symbol identifier to the code (or data) address it represents (when the address has been resolved):

symbol table

0000 E8 00 01 00 00 call my_func "printf" 0005 E8 ?? ?? ?? ?? call printf "my_func" ... my_func: "fopen" 0100 E8 ?? ?? ?? ?? call fopen 0105 90 nop 0106 E8 ?? ?? ?? ?? call printf

Gabriel Laskar (EPITA) CAAL 2015 232 / 378 Object file formats Binary formats How about code reuse and splitting The relocation table

The relocation table associates each “hole” address to a label identifier address: symbol table relocation table

printf 0000 E8 00 01 00 00 call my_func "printf" 0006 0005 E8 ?? ?? ?? ?? call printf "my_func" printf ... 0107 my_func: "fopen" fopen 0101 0100 E8 ?? ?? ?? ?? call fopen 0105 90 nop 0106 E8 ?? ?? ?? ?? call printf

Gabriel Laskar (EPITA) CAAL 2015 233 / 378 Object file formats Binary formats How about code reuse and splitting BSS regions

Instead of keeping a bunch of empty bytes in the binary file for big static variables (arrays), the header can specify a whole region which must be filled with zero. BSS regions makes the file smaller and faster to load.

Gabriel Laskar (EPITA) CAAL 2015 234 / 378 Object file formats Binary formats Linking 6: Object file formats

Build process

Binary formats Simple binary formats How about code reuse and splitting Linking File formats history

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 235 / 378 Object file formats Binary formats Linking Linking

The operation that combines a list of object programs into a binary program is called linking. The tasks that must be accomplished by the linker are: 1. Named sections from different object programs must be merged together into one named section. 2. Merged sections must be put together into the sections of the memory model. 3. Each use of a name in an object program’s references list must be replaced by an address in the virtual address space.

Gabriel Laskar (EPITA) CAAL 2015 236 / 378 Object file formats Binary formats Linking Merging object files

Combining each .text section together, each .data section together, etc.:

.text .text

.text

.rodata .rodata

.rodata

.data .data

.data

Gabriel Laskar (EPITA) CAAL 2015 237 / 378 Object file formats Binary formats Linking Zoom on .text section merging

A object B object merged object 0x0section 0x0section 0x0 section A

.text

B .text

.text

C Symbol = S Symbol = B + S

Gabriel Laskar (EPITA) CAAL 2015 238 / 378 Object file formats Binary formats Linking Symbol resolution

raw binary format .text / .data

a.out format file header .text .data relocations symbols

Gabriel Laskar (EPITA) CAAL 2015 239 / 378 Object file formats Binary formats Linking Executable binary files

When there is neither unresolved symbols nor relocations anymore, the file is executable. Executable files usually have a fixed constant load address on a given system. Unresolved/unused symbols may exist in an executable file if no relocation uses it. The strip command wipes out unused symbols from the symbol table.

Gabriel Laskar (EPITA) CAAL 2015 240 / 378 Object file formats Binary formats File formats history 6: Object file formats

Build process

Binary formats Simple binary formats How about code reuse and splitting Linking File formats history

Executable code loading

Gabriel Laskar (EPITA) CAAL 2015 241 / 378 Object file formats Binary formats File formats history a.out: Assembler Output

raw binary format .text / .data

a.out format file header .text .data relocations symbols

I Very simple

I .. but pretty fast to load

Gabriel Laskar (EPITA) CAAL 2015 242 / 378 Object file formats Binary formats File formats history COFF (Common Object File Format) and ELF (Executable and Linking Format)

COFF format file section syms .text relocs .data header table

ELF format file section relocs symbols .text .data header table

I Permits use of dynamic libraries

Gabriel Laskar (EPITA) CAAL 2015 243 / 378 Object file formats Binary formats File formats history Mach-O (Darwin)

file load text rodata data header commands

Gabriel Laskar (EPITA) CAAL 2015 244 / 378 Object file formats Binary formats File formats history Mach-O (Darwin) “Fat binaries” Mach−O fat binaries Fat header

Fat arch table

file load text rodata data header commands

file load text rodata data header commands

Gabriel Laskar (EPITA) CAAL 2015 245 / 378 Object file formats Executable code loading 6: Object file formats

Build process

Binary formats

Executable code loading Binary file loading Loading process Dynamic libraries

Gabriel Laskar (EPITA) CAAL 2015 246 / 378 Object file formats Executable code loading Binary file loading 6: Object file formats

Build process

Binary formats

Executable code loading Binary file loading Loading process Dynamic libraries

Gabriel Laskar (EPITA) CAAL 2015 247 / 378 Object file formats Executable code loading Binary file loading Binary file loading

Executable file format is like object file format, with the following restrictions:

I It has no relocations,

I All addresses are resolved

I It is ready to load in memory

Gabriel Laskar (EPITA) CAAL 2015 248 / 378 Object file formats Executable code loading Binary file loading Executable format

file header .text

.rodata

.data

.bss (not in file) section table reloc. table

symbol table

string table

objectGabriel Laskar file (EPITA) headers objectCAAL file sections 2015 249 / 378 Object file formats Executable code loading Binary file loading Executable loading

Process memory structure is mapped to internal executable file format:

I Loadable sections are loaded from file to memory

I Uninitialised data (.bss section) is allocated in process memory

I Internal file sections are ignored

I Heap and stack are allocated

Gabriel Laskar (EPITA) CAAL 2015 250 / 378 Object file formats Executable code loading Loading process 6: Object file formats

Build process

Binary formats

Executable code loading Binary file loading Loading process Dynamic libraries

Gabriel Laskar (EPITA) CAAL 2015 251 / 378 Object file formats Executable code loading Loading process Executable memory mapping

process memory

file header .text .text initialized .rodata .rodata memory .data .data

.bss (not in file) .bss section table non reloc. table heap initialized memory symbol table stack string table

object file headers object file sections memory sections

Gabriel Laskar (EPITA) CAAL 2015 252 / 378 Object file formats Executable code loading Loading process Memory attributes

Memory attributes depends on section and area type:

I .text section may be loaded in executable only region (depending on OS)

I memory used to store .rodata section will be marked as read-only

I memory used to store .data, .bss sections, heap and stack will be marked as read/write

Gabriel Laskar (EPITA) CAAL 2015 253 / 378 Object file formats Executable code loading Loading process Memory attributes

executable file process memory

.text .text −−x executable data pages .rodata loaded .rodata initialized r−− read only from file memory pages .data .data

not in file .bss rw− read/write sym. table heap non pages internal initialized string table file data memory stack

object file sections memory sections virtual memory rights

Gabriel Laskar (EPITA) CAAL 2015 254 / 378 Object file formats Executable code loading Dynamic libraries 6: Object file formats

Build process

Binary formats

Executable code loading Binary file loading Loading process Dynamic libraries

Gabriel Laskar (EPITA) CAAL 2015 255 / 378 Object file formats Executable code loading Dynamic libraries Dynamic libraries

Use of dynamic libraries implies:

I different and more complex memory mapping

I position independent code

I different way to handle relocations

I plenty of cool complicated stuff

Gabriel Laskar (EPITA) CAAL 2015 256 / 378 Object file formats Executable code loading Dynamic libraries Dynamic libs in memory

Libraries have specific memory mapping and organisation:

I A dynamic library is loaded only once even if several processes use it

I Extra .data and .text sections are mapped in process memory-space

I Library .text section is mapped in several process memory

I Library .data section is copied in each process memory

Gabriel Laskar (EPITA) CAAL 2015 257 / 378 Object file formats Executable code loading Dynamic libraries Dynamic process memory layout

executable A process A process B executable B

.text .text .text .text .rodata .rodata .data .data .data .data .data .data copy library B library A copy .text .data .text .text .text .text heap heap stack stack

Gabriel Laskar (EPITA) CAAL 2015 258 / 378 Object file formats Executable code loading Dynamic libraries Handling code location change

Dynamic libraries may not be mapped with the same virtual address in all processes:

I Address may already be in use by an other library

I The same code must be able to run with different base addresses Position independent code solves these problems without going through the complex relocation process. Relative jumps and relative data memory accesses need to be handled by the CPU.

Gabriel Laskar (EPITA) CAAL 2015 259 / 378 Object file formats Executable code loading Dynamic libraries Position independent code

executable process

.text .text .rodata 0100 jmp +12 .rodata .data .data position relative independent 0112 mov ... branch code

.text

library heap stack

Gabriel Laskar (EPITA) CAAL 2015 260 / 378 Object file formats Executable code loading Dynamic libraries Position dependent code

Some location change are more complicated with dynamic libraries:

I .data can’t be referred with relative addressing from .text section if no relative data access is available.

I .data section won’t have the same address in all process memory maps.

I .text section is common to all processes and can’t reflect .data location differences. The solution is to use indirect memory access to reach data objects from code. One GOT (Global Offset Table) and PLT (Procedure Linkage Table) will hold data object addresses for each process. Some dynamic relocations and data copy will also help to solve this problem.

Gabriel Laskar (EPITA) CAAL 2015 261 / 378 Object file formats Executable code loading Dynamic libraries Use of GOT/PLT

GOT/PLT executable process in process A

.text .text .rodata .rodata .data .data GOT/PLT

.data call printf .text .text

library heap stack

Gabriel Laskar (EPITA) CAAL 2015 262 / 378 Assembly programming

Part VII

Assembly programming

Gabriel Laskar (EPITA) CAAL 2015 263 / 378 Assembly programming Introduction 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C

Gabriel Laskar (EPITA) CAAL 2015 264 / 378 Assembly programming Introduction What is assembly?

Assembly is not

I binary data

I directly interpreted by the processor

I don’t mix it up with Assembly is

I a language, with a syntax, keywords, etc.

I translated in a binary format

Gabriel Laskar (EPITA) CAAL 2015 265 / 378 Assembly programming Introduction Benefits of assembly programming

I Improve developper’s understanding of and programming

I Direct access to hardware resources

I Access to system features

I Speed and efficiency of programs

I Low footprint of programs

Gabriel Laskar (EPITA) CAAL 2015 266 / 378 Assembly programming Introduction Drawbacks of assembly programming

I Low-level programming forbids abstraction

I Assembly code is hard to keep clean

I Slows development down: lots of keywords, unusual behaviors

I Each processor architecture has its own language: conventions, registers, instructions, privileges, allowed arguments

Gabriel Laskar (EPITA) CAAL 2015 267 / 378 Assembly programming Pure assembly files 7: Assembly programming

Introduction

Pure assembly files Anatomy of assembly code Examples

Inline Assembly in C

Gabriel Laskar (EPITA) CAAL 2015 268 / 378 Assembly programming Pure assembly files Anatomy of assembly code 7: Assembly programming

Introduction

Pure assembly files Anatomy of assembly code Examples

Inline Assembly in C

Gabriel Laskar (EPITA) CAAL 2015 269 / 378 Assembly programming Pure assembly files Anatomy of assembly code Assembly file organization

Source file is organized in different sections: code Contains program binary opcodes data Contains program global variables rodata Contains read-only program variables

Gabriel Laskar (EPITA) CAAL 2015 270 / 378 Assembly programming Pure assembly files Anatomy of assembly code Assembly language statements

directives Determine the assembler behavior instructions Chosen in the CPU instruction set, translated in binary format, written in output file operands Expressions containing register identifiers, immediate operands, symbols labels Between instructions, used to mark specific locations in source code

Gabriel Laskar (EPITA) CAAL 2015 271 / 378 Assembly programming Pure assembly files Examples 7: Assembly programming

Introduction

Pure assembly files Anatomy of assembly code Examples

Inline Assembly in C

Gabriel Laskar (EPITA) CAAL 2015 272 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example

.mod_load asm- .define FOO5

.section code .text .mod_asm opcodes

add %g0, FOO, %g1 .ends

Gabriel Laskar (EPITA) CAAL 2015 273 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example

.mod_load asm-sparc .mod_load out-elf32

.section code .text .mod_asm opcodes v8

mov 0x12, %g1 mov 0x34533, %g2

add %g1, %g2, %g3 .ends

Gabriel Laskar (EPITA) CAAL 2015 274 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example .mod_load asm-sparc .include sparc/v8.def

.section data .data lbl: .reserve4 .ends

.section code .text .mod_asm opcodes v8

@set .data:lbl, %g2 ld[%g2],%g1 .ends

Gabriel Laskar (EPITA) CAAL 2015 275 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example

.mod_load asm-sparc .include sparc/v8.def

.section code .text .mod_asm opcodes v8

.export main .proc main ret restore .endp .ends

Gabriel Laskar (EPITA) CAAL 2015 276 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example

.mod_load asm-sparc .include sparc/v8.def

.section code .text .mod_asm opcodes v8

.extern exit

.proc my_exit call exit nop .endp .ends

Gabriel Laskar (EPITA) CAAL 2015 277 / 378 Assembly programming Inline Assembly in C 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz

Gabriel Laskar (EPITA) CAAL 2015 278 / 378 Assembly programming Inline Assembly in C Presentation 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz

Gabriel Laskar (EPITA) CAAL 2015 279 / 378 Assembly programming Inline Assembly in C Presentation Why?

I C can’t express everything

I Cpu-specific registers (flags, condition codes, hardware counters) I Features not addressed by language (atomic operations) I System features (Memory handling, handling, exception handling) I Compiler are not always aware of possible optimizations

I Use of instruction side-effects I Un-optimizable patterns (or optimization patterns specific to only one )

Gabriel Laskar (EPITA) CAAL 2015 280 / 378 Assembly programming Inline Assembly in C Presentation Inline Assembly Compiler’s PoV

For the compiler, the assembly you pass in is:

I a raw string

I with a printf-like syntax

I passed directly to the assembler I surounded by optional statements

I input values I output values I clobbered values

Gabriel Laskar (EPITA) CAAL 2015 281 / 378 Assembly programming Inline Assembly in C Presentation Inline Assembly Programmer’s PoV

For you, the assembly you pass is meant to either

I compute a value

I change some hardware feature

I have a side-effect

Gabriel Laskar (EPITA) CAAL 2015 282 / 378 Assembly programming Inline Assembly in C Simple examples 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz

Gabriel Laskar (EPITA) CAAL 2015 283 / 378 Assembly programming Inline Assembly in C Simple examples Example No data

static inline void disable_interrupts(void) { asm volatile("cli"); }

Gabriel Laskar (EPITA) CAAL 2015 284 / 378 Assembly programming Inline Assembly in C Simple examples Example Output only

static inline cpu_cycle_t cpu_cycle_count(void) { uint32_t low, high;

asm("rdtsc": "=a" (low), "=d" (high));

return (low|(( uint64_t)high << 32)); }

Gabriel Laskar (EPITA) CAAL 2015 285 / 378 Assembly programming Inline Assembly in C Simple examples Example Input only

static inline void cpu_io_write_8(uintptr_t addr, uint8_t data) { asm volatile("outb %0, %1" : : "a" (data), "d"(( uint16_t)addr) ); }

Gabriel Laskar (EPITA) CAAL 2015 286 / 378 Assembly programming Inline Assembly in C Simple examples Example In-out

static inline uint32_t cpu_endian_swap32(uint32_t x) { asm ("bswap %0" : "=r" (x) : "0" (x) );

return x; }

Gabriel Laskar (EPITA) CAAL 2015 287 / 378 Assembly programming Inline Assembly in C Syntax 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz

Gabriel Laskar (EPITA) CAAL 2015 288 / 378 Assembly programming Inline Assembly in C Syntax Syntax

I one statement with optional arguments asm [volatile]("statements \n\t" "on many lines \n\t" [: [output_variables] [: [input_variables] [: clobbered_registers]]] );

I arguments are referenced in-order (%0..%n), whether they are input or output!

I arguments types abide constraints, enclosed in""

Gabriel Laskar (EPITA) CAAL 2015 289 / 378 Assembly programming Inline Assembly in C Complex examples 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz

Gabriel Laskar (EPITA) CAAL 2015 290 / 378 Assembly programming Inline Assembly in C Complex examples Add with carry and overflow C

uint32_t add_cv(uint32_t a, uint32_t b, uint8_t cin, uint8_t *cout, uint8_t *vout) { uint64_t sum=( uint64_t)a+( uint64_t)b+( uint64_t)cin; *cout= sum >> 32; *vout= ( (b^(( uint32_t)sum))&~(a^b) ) >> 31; return sum; }

Gabriel Laskar (EPITA) CAAL 2015 291 / 378 Assembly programming Inline Assembly in C Complex examples Add with carry and overflow ASM uint32_t add_cv(uint32_t a, uint32_t b, uint8_t cin, uint8_t *cout, uint8_t *vout) { uint32_t result;

asm(" btl $0, %k5 \n" " adcl %k3, %k4 \n" " setc %b1 \n" " seto %b2 \n" : "=r" (result), "=qm"(*cout), "=qm"(*vout) : "r" (a), "0" (b), "r" (cin) );

return result; }

Gabriel Laskar (EPITA) CAAL 2015 292 / 378 Assembly programming Inline Assembly in C Complex examples Atomic increment for ARM static inline bool_t cpu_atomic_inc(volatile atomic_int_t *a) { reg_t tmp=0, tmp2;

asm volatile("1: \n\t" "ldrex %0, [%2] \n\t" "add %0, %0, #1 \n\t" "strex %1, %0, [%2] \n\t" "tst %1, #1 \n\t" "bne 1b \n\t" : "=&r" (tmp), "=&r" (tmp2) : "r" (a) : "m"(*a) );

Gabrielreturn Laskar (EPITA)tmp !=0; CAAL 2015 293 / 378 } Assembly programming Inline Assembly in C Named values 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz

Gabriel Laskar (EPITA) CAAL 2015 294 / 378 Assembly programming Inline Assembly in C Named values Named values

I Using numbers for values makes code harder to write and read

I GCC permits named values in inline assembly

Gabriel Laskar (EPITA) CAAL 2015 295 / 378 Assembly programming Inline Assembly in C Named values Named values With numbers static inline bool_t cpu_atomic_inc(volatile atomic_int_t *a) { reg_t tmp=0, tmp2;

asm volatile("1: \n\t" "ldrex %0, [%2] \n\t" "add %0, %0, #1 \n\t" "strex %1, %0, [%2] \n\t" "tst %1, #1 \n\t" "bne 1b \n\t" : "=&r" (tmp), "=&r" (tmp2) : "r" (a) : "m"(*a) );

Gabriel Laskar (EPITA) CAAL 2015 296 / 378 return tmp !=0; } Assembly programming Inline Assembly in C Named values Named values Named static inline bool_t cpu_atomic_inc(volatile atomic_int_t *a) { reg_t tmp=0, tmp2;

asm volatile("1: \n\t" "ldrex %[tmp], [%[atomic]] \n\t" "add %[tmp], %[tmp], #1 \n\t" "strex %[tmp2], %[tmp], [%[atomic]] \n\t" "tst %[tmp2], #1 \n\t" "bne 1b \n\t" : [tmp] "=&r" (tmp), [tmp2] "=&r" (tmp2) , [clobber] "=m"(*a) : [atomic] "r" (a) );

Gabriel Laskar (EPITA) CAAL 2015 297 / 378 return tmp !=0; } Assembly programming Inline Assembly in C Quizz 7: Assembly programming

Introduction

Pure assembly files

Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz

Gabriel Laskar (EPITA) CAAL 2015 298 / 378 Assembly programming Inline Assembly in C Quizz Quizz! What does this do?

asm volatile("":::"memory");

Gabriel Laskar (EPITA) CAAL 2015 299 / 378 Focus on x86

Part VIII

Focus on x86

Gabriel Laskar (EPITA) CAAL 2015 300 / 378 Focus on x86 x86 story 8: Focus on x86

x86 story Timeline x86 architecture Instruction set

Gabriel Laskar (EPITA) CAAL 2015 301 / 378 Focus on x86 x86 story Timeline 8: Focus on x86

x86 story Timeline x86 architecture Instruction set

Gabriel Laskar (EPITA) CAAL 2015 302 / 378 Focus on x86 x86 story Timeline Early events and CPU development

1971 Intel 4004 1978 & 8088, first x86 CPUs 1981 Intel 80186 1982 Intel 80286: 16 bits, 24 address bits, protection

Gabriel Laskar (EPITA) CAAL 2015 303 / 378 Focus on x86 x86 story Timeline Successful CPU development

1985 Intel 80386: 32 bits, MMU 1989 Intel 80486: on-chip cache, pipeline 1993 Intel PentiumTM: superscalar, , 64 address bits 1995 Intel Pentium Pro: ooo 1997 Intel Pentium II: internal L2 1999 Intel Pentium III: ! 1999 AMD Athlon: ooo, 3dnow

Gabriel Laskar (EPITA) CAAL 2015 304 / 378 Focus on x86 x86 story Timeline Technology barriers

2000 Intel Pentium 4: HT 2001 Intel Itanium 2003 AMD Athlon 64 2003 Intel Pentium M: P6 (PPro to PIII) 2006 Intel Pentium 4 Prescott 2M: 64 bits 2006 Intel Core/Core2: fusion 2009 Intel Core i5/i7: ...

Gabriel Laskar (EPITA) CAAL 2015 305 / 378 Focus on x86 x86 story x86 architecture 8: Focus on x86

x86 story Timeline x86 architecture Instruction set

Gabriel Laskar (EPITA) CAAL 2015 306 / 378 Focus on x86 x86 story x86 architecture x86 architecture

The x86 architecture is:

I based on a CISC instruction set.

I based on little-endian memory access.

I based on two operands instructions including memory access.

I backwards compatible: all new x86 processors are fully compatible with their predecessors.

I very complex: Pentium IV has a 20 stages pipeline.

I newer processors have less pipeline stages

Gabriel Laskar (EPITA) CAAL 2015 307 / 378 Focus on x86 x86 story x86 architecture Available registers

First x86 generation had a few 16-bits registers available:

I General purpose 16 bits registers:

I ax, bx, cx, dx, si, di I General purpose 8 bits registers:

I al, bl, cl, dl I ah, bh, ch, dh I Stack and frame registers:

I sp, bp

I Flag registers, ...

Gabriel Laskar (EPITA) CAAL 2015 308 / 378 Focus on x86 x86 story x86 architecture Nested registers

AX 16 bits register

AH AL 8 bits registers

Gabriel Laskar (EPITA) CAAL 2015 309 / 378 I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled

I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, I r8, r9, r10, r11, r12, r13, r14, r15

I AMD licensed the amd64 to Intel after the Itanium flop...

Focus on x86 x86 story x86 architecture Register extensions

I Starting with the 386 processor, all registers are now 32-bits wide.

I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations.

Gabriel Laskar (EPITA) CAAL 2015 310 / 378 I r8, r9, r10, r11, r12, r13, r14, r15

I AMD licensed the amd64 to Intel after the Itanium flop...

Focus on x86 x86 story x86 architecture Register extensions

I Starting with the 386 processor, all registers are now 32-bits wide.

I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations. I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled

I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp,

Gabriel Laskar (EPITA) CAAL 2015 310 / 378 I AMD licensed the amd64 to Intel after the Itanium flop...

Focus on x86 x86 story x86 architecture Register extensions

I Starting with the 386 processor, all registers are now 32-bits wide.

I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations. I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled

I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, I r8, r9, r10, r11, r12, r13, r14, r15

Gabriel Laskar (EPITA) CAAL 2015 310 / 378 Focus on x86 x86 story x86 architecture Register extensions

I Starting with the 386 processor, all registers are now 32-bits wide.

I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations. I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled

I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, I r8, r9, r10, r11, r12, r13, r14, r15

I AMD licensed the amd64 to Intel after the Itanium flop...

Gabriel Laskar (EPITA) CAAL 2015 310 / 378 Focus on x86 x86 story x86 architecture 32/64 bits nested registers

RAX 64 bits register EAX 32 bits register

AX 16 bits register

AH AL 8 bits registers

Gabriel Laskar (EPITA) CAAL 2015 311 / 378 Focus on x86 x86 story x86 architecture Instruction format

x86 architecture is a CISC based architecture:

I All instructions have a variable length,

I Instructions length can’t be determined before reading first bytes,

I Many instructions with different formats are available.

Gabriel Laskar (EPITA) CAAL 2015 312 / 378 Focus on x86 x86 story x86 architecture Instruction format

8086-80286

opcode opc2 mod immediate memory r/m displacement

Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture Instruction format

80386-Pentium

data address size size prefix prefix

opcode prefix opcode opc2 mod SIB immediate memory r/m displacement

Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture Instruction format

3dnow!

data address size size prefix prefix

opcode prefix opcode opc2 mod SIB immediate memory r/m displacement

imm 3dnow

Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture Instruction format

x86-64

x86−64 data address prefix size size prefix prefix

opcode prefix opcode opc2 mod SIB immediate memory r/m displacement

imm 3dnow

Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture

The x86 architecture supports several complex addressing modes. On 32 bit processors (386 and later) a memory address can contain:

I A base address register

I A address displacement

I An index register

I An multiply factor on index register Example: mov eax, [ebx + 0x12345 + ecx * 8]

Gabriel Laskar (EPITA) CAAL 2015 314 / 378 Focus on x86 x86 story x86 architecture Wired stack management

Specific instructions are available for stack management:

I push x instruction can be used to store data on the top of the stack and decrement the stack pointer.

I pop x instruction removes data from the stack.

I call and ret instructions directly push and pop return address on and from the stack. The (e)sp register is used as stack pointer.

Gabriel Laskar (EPITA) CAAL 2015 315 / 378 Focus on x86 x86 story x86 architecture Branch

I The x86 architecture uses a flag register to manage conditional branches.

I Many different instructions with different lengthes can be used depending on jump size.

I No delayed slot are used

Gabriel Laskar (EPITA) CAAL 2015 316 / 378 Focus on x86 x86 story x86 architecture Jump size

long jump short jump (−32768 to +32767) (−128 to +127)

E9 EB

opc immediate opc imm. 16/32 bits 8 bits

Gabriel Laskar (EPITA) CAAL 2015 317 / 378 Focus on x86 x86 story Instruction set 8: Focus on x86

x86 story Timeline x86 architecture Instruction set

Gabriel Laskar (EPITA) CAAL 2015 318 / 378 Focus on x86 x86 story Instruction set Instruction set

x86 instruction set is huge:

I Over 580 instructions handled by Pentium 3 CPU

I Over 850 opcodes handled by Pentium 3 CPU A single instruction can be very complex:

I Memory access capable

I Handle different data widths

I Perform complex computation

Gabriel Laskar (EPITA) CAAL 2015 319 / 378 Focus on x86 x86 story Instruction set Instruction set extensions

Extensions to the x86 instruction set are common and all instructions are not available on all CPUs. Starting with the pentium processor, SIMD-based instruction sets appeared:

I MMX

I 3dNow!

I MMX2, SSE

I 3dNow2!, SSE2, SSE3, SSE4s

I To be continued...

Gabriel Laskar (EPITA) CAAL 2015 320 / 378 Focus on x86 x86 story Instruction set Instruction set SIMD operations

normal operation SIMD operation

source 1 64 bits value 4 x 16 bits register

source 2 64 bits value 4 x 16 bits register

destination 64 bits value 4 x 16 bits register

Gabriel Laskar (EPITA) CAAL 2015 321 / 378 Focus on x86 x86 story Instruction set x86-specific coding

Due to its amazing number of available instructions, x86 code is highly tunable for optimizations:

I High level languages compilers are not able to use all instructions efficiently,

I Hand written assembly is often faster,

I Complex Instructions can be used to process unexpected tasks.

Gabriel Laskar (EPITA) CAAL 2015 322 / 378 Focus on x86 x86 story Instruction set Instruction set has-been parts

I aaa, aad, aam, aas, daa, das

I xlat

Gabriel Laskar (EPITA) CAAL 2015 323 / 378 Focus on x86 x86 story Instruction set x86-specific coding example: LEA

The LEA instruction is designed to compute memory addresses. But it can be used to add and multiply values. lea eax,[ebx+ ecx]

lea eax,[ebx*8+ ebx]

Gabriel Laskar (EPITA) CAAL 2015 324 / 378 Focus on x86 x86 story Instruction set x86-specific coding example: tiniest mem*()?

memcpy rep movbs memset rep stosd memcmp repe cmpsb strncmp repnz cmpsb

Gabriel Laskar (EPITA) CAAL 2015 325 / 378 Focus on RISC processors

Part IX

Focus on RISC processors

Gabriel Laskar (EPITA) CAAL 2015 326 / 378 Focus on RISC processors History 9: Focus on RISC processors

History

Some instruction sets

Gabriel Laskar (EPITA) CAAL 2015 327 / 378 Focus on RISC processors History Let’s see a timeline...

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

CDC 6600 Superscalar (Cray) 3DO DS GBA Data General Nova AcornARM projectARM 1 ARM 2 ARMARM ltd 6 (v3) ARM 7 ARM 9−tdmi ARM 11 M3 A8 M1 M0 (Edson de Castro) Cambridge Thumb A9 A9 >2GHz Hard / +licensingSoft core Queen award

SUN DARPARISC Project RISC−I RISC−II SPARC SPARC−v8 SPARC−v9 Ultra−1 Ultra−2 Ultra−3 UC−Berkeley Open source Patterson

PSP N64 SGI PlayStation Mips Mips−r2000 Mips−r3000 Mips−r4000 Mips−r8k Mips−r10k Mips−r12k PlayStation 2 Mips−r16k r24k DARPA Stanford 64−bit Mips IV Mips V Mips32 Soft−core Hennessy +licensing PS3 IBM RISC CPU Cheetah America Bellatrix STI project CELL Unsure

A−I−M RS−6000 RSC PPC POWER 3 POWER 4 POWER 5 DARPAPOWER 7

RIOS 1/9 64 bits Hard−core GameCube Wii XBox 360

DEC PRISM Alpha Alpha Alpha Alpha EV4 EV5 EV6 EV7 21064 21164 21264 21364 Unsure Saturn HitachiSuper H SH−2 SH−3 SH−4 SH−6 SH−7 Dreamcast (announced) licensing issues HP TS1 PCX−S PCX−L PCX−U PCX−W+ Shortfin

PA−Risc PA−7000 PA−7100−LC PA−8000 PA−8600 PA−8900 Unsure (SIMD)

Gabriel Laskar (EPITA) CAAL 2015 328 / 378 Focus on RISC processors Some instruction sets 9: Focus on RISC processors

History

Some instruction sets Mips SPARC PPC ARM

Gabriel Laskar (EPITA) CAAL 2015 329 / 378 Focus on RISC processors Some instruction sets Mips 9: Focus on RISC processors

History

Some instruction sets Mips SPARC PPC ARM

Gabriel Laskar (EPITA) CAAL 2015 330 / 378 Focus on RISC processors Some instruction sets Mips Mips Instruction format

012345678910111213141516171819202122232425262728293031 J/JAL offset OP rs rt imm Special rs rt rd ... op

I 32 GPR

I hard-wired r0 to 0

I 32 FPU registers, all sizes aliased

I delay slot

Gabriel Laskar (EPITA) CAAL 2015 331 / 378 Focus on RISC processors Some instruction sets Mips Mips Asm code addiu r2, r3, 0x42 ; alu reg / imm or r3, r4, r5 ; alu reg / reg

lui r2, 0x1234 ; load upper immediate ori r2, r2, 0x5678 ; r2 = 0x12345678

lw r4, 0x1234(r9) ; load word from r9+ 0x1234

slt r1, r2, r3 ; test if r2 < r3 bnez r1,2f ; jump if true nop ; delay slot

lbu r2,(r3) ; load byte, not sign extended

jal foo ; pc = foo; r31 = return address Gabrielnop Laskar (EPITA) CAAL 2015 332 / 378 Focus on RISC processors Some instruction sets SPARC 9: Focus on RISC processors

History

Some instruction sets Mips SPARC PPC ARM

Gabriel Laskar (EPITA) CAAL 2015 333 / 378 Focus on RISC processors Some instruction sets SPARC SPARC

Instruction format 012345678910111213141516171819202122232425262728293031 01 offset

012345678910111213141516171819202122232425262728293031 00 rd op2 imm 00 a cond op2 imm

012345678910111213141516171819202122232425262728293031 1x rd op3 rs1 0 asi rs2 1x rd op3 rs1 1 imm 1x rd op3 rs1 opf rs2

I 32 GPR I hard-wired %g0 to 0 I register window I unwindowed aliased FPU registers, 8 * 128 bits, 32 * 32 bits

I Gabrieldelay Laskar slot (EPITA) CAAL 2015 334 / 378 Focus on RISC processors Some instruction sets SPARC SPARC Asm code

add %g3, 0x42, %g2 ; alu reg / imm or %g3, %g4, %g5 ; alu reg / reg

sethi 0x12345, %g2 ; load upper immediate(20 bits) or %g2, 0x678, %g2 ; g2 = 0x12345678

ld[%l2+ 0x1234], %g4 ; load word from l2+ 0x1234 ld[%l2+ %i3], %g4 ; load word from l2+ i3

tst %l2, %l3 ; test if l2 < l3 blt3f ; jump if true

Gabriel Laskar (EPITA) CAAL 2015 335 / 378 Focus on RISC processors Some instruction sets PPC 9: Focus on RISC processors

History

Some instruction sets Mips SPARC PPC ARM

Gabriel Laskar (EPITA) CAAL 2015 336 / 378 Focus on RISC processors Some instruction sets PPC PowerPC Instruction format

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

op jump offset aa lk op rd ra imm op rd ra rb mb me rc op rd ra rb op2 rc

I 32 GPR

I 32 FPR, fixed internal representation

I additional registers for lr, ctr

I no global flags, but 8 “cause registers” crx I can merge causes with logical operations

Gabriel Laskar (EPITA) CAAL 2015 337 / 378 Focus on RISC processors Some instruction sets PPC PowerPC Asm code addi3,2, 42 ; alu reg / imm or.3,4,5 ; alu reg / reg, update cr0

lis3, 0x1234 ; load upper immediate(16 bits) ori3,3, 0x5678 ; r3 = 0x12345678

lwz4,9, 0x1234 ; load word from r9+ 0x1234 lwzx4,9,3 ; load word from r9+ r3

mtctr4 ; put r4 in count register ... bdnzl2f ; Decrement CTR, Branch if CTR! = 0

rlwinm ... ; Rotate and mask

Gabriel Laskar (EPITA) CAAL 2015 338 / 378 Focus on RISC processors Some instruction sets ARM 9: Focus on RISC processors

History

Some instruction sets Mips SPARC PPC ARM

Gabriel Laskar (EPITA) CAAL 2015 339 / 378 You can prefix any instruction with “never”:

I 1/16th of the instruction set is “nop”...

Focus on RISC processors Some instruction sets ARM ARM Instruction features

I 16 GPR, one of them is pc (r15) I Every instruction is “guarded” by a condition. It may be:

I always, >, >=, “higher”, overflow, negative, carry, == I their opposites

I aliased FPR: 16 double, 32 float

Gabriel Laskar (EPITA) CAAL 2015 340 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction features

I 16 GPR, one of them is pc (r15) I Every instruction is “guarded” by a condition. It may be:

I always, >, >=, “higher”, overflow, negative, carry, == I their opposites

I aliased FPR: 16 double, 32 float You can prefix any instruction with “never”:

I 1/16th of the instruction set is “nop”...

Gabriel Laskar (EPITA) CAAL 2015 340 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction format

012345678910111213141516171819202122232425262728293031 cond 00 i op s rn rd imm/reg cond 01 i p u bw l rn rd imm/reg cond 100 p u bw l rn reg-list cond 101 l jmp offset cond 11 coproc, sys

Gabriel Laskar (EPITA) CAAL 2015 341 / 378 and r3, r7, #42 ; r3 = r7 & 0x2a add r2, r8, r5, lsl r1 ; r2 = r8+(r5 << r1) sub r1, r9, r1, lsl #1 ; r1 = r9 - (r1 << 1)

012345678910111213141516171819202122232425262728293031 1110 00 1 and 0 r7 r3 « 0 0x2a 1110 00 0 add 0 r8 r2 r1 0 lsl 1 r5 1110 00 0 sub 0 r9 r1 0x1 lsl 0 r1

Focus on RISC processors Some instruction sets ARM ARM Instruction format (2) 012345678910111213141516171819202122232425262728293031 cond 00 i op s rn rd imm/reg cond 01 i p u bw l rn rd imm/reg 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm

Gabriel Laskar (EPITA) CAAL 2015 342 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction format (2) 012345678910111213141516171819202122232425262728293031 cond 00 i op s rn rd imm/reg cond 01 i p u bw l rn rd imm/reg 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm

and r3, r7, #42 ; r3 = r7 & 0x2a add r2, r8, r5, lsl r1 ; r2 = r8+(r5 << r1) sub r1, r9, r1, lsl #1 ; r1 = r9 - (r1 << 1)

012345678910111213141516171819202122232425262728293031 1110 00 1 and 0 r7 r3 « 0 0x2a 1110 00 0 add 0 r8 r2 r1 0 lsl 1 r5 1110Gabriel00 Laskar0 (EPITA)sub 0 r9 r1 CAAL0x1 lsl 0 r1 2015 342 / 378 Focus on RISC processors Some instruction sets ARM ARM Asm code add r3, r2, #0x42 ; alu reg / imm orr r3, r4, r5 ; alu reg / reg bic r3, r4, r5, lsr #4 ; alu reg / reg & shift

cmp r2, #0 ; test if r2 < 0 rsblt r2, r2, #0 ; then r2 = 0 - r2

ldr r2, #0x12345678 ; rewritten as ldr r2,[pc, #offset] ; pc-relative load

streq r4,[r2, r3, lsl #4] ; store word to r2 + r3 << 4 ; only if last cmp is ‘eq’ str r4,[r2, #8]! ; store word to r2 + 8 ; and r2 = r2+8 offset: ; 32-bit immediates zone

Gabriel0x12345678 Laskar (EPITA) CAAL 2015 343 / 378 Focus on RISC processors Some instruction sets ARM ARM Block transfers

012345678910111213141516171819202122232425262728293031 cond 100 p u bw l rn reg-list

stmdb sp!, {r2, r3, r4, r8} push {r2, r3, r4, r8}

012345678910111213141516171819202122232425262728293031 cond 100 1 0 0 1 0 r13 0000000100011100

Gabriel Laskar (EPITA) CAAL 2015 344 / 378 Focus on RISC processors Some instruction sets ARM ARM Block transfers hacks (1) The link register is r14, then the following code saves r14 and restores it to PC afterwards... push {r2, r3, r4, r8, lr} ... pop {r2, r3, r4, r8, pc}

..00 ..04 sp → r2 ..08 r3 ..0c r4 ..10 r8 ..14 lr ..18 old sp → xxx ..1c ..20

Gabriel Laskar (EPITA) CAAL 2015 345 / 378 Focus on RISC processors Some instruction sets ARM ARM Block transfers hacks (2)

Do a memcpy with some free registers, for instance, copy 16 bytes from *r0 to *r1, not using r2 nor r5, update r0 and r1 to point on next block... ldmia r0!, {r3, r4, r6, r7} stmia r1!, {r3, r4, r6, r7}

Gabriel Laskar (EPITA) CAAL 2015 346 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction-space exhaustion

Eventually, ARM instruction-set continued to grow despite the obvious exhaustion of the “clean” instruction-set space.

I abuse of concatenation of seldom bits around the instruction word

I sometimes forbid r15 as an , and reuse other fields

I reuse the “never” condition code

Gabriel Laskar (EPITA) CAAL 2015 347 / 378 cond 00 0 000 a s rn rd rs 1001 rm

Focus on RISC processors Some instruction sets ARM ARM Instruction-space exhaustion Illustration

012345678910111213141516171819202122232425262728293031 cond 00 I op s rn rd op2 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm

How to add the multiplication?

Gabriel Laskar (EPITA) CAAL 2015 348 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction-space exhaustion Illustration

012345678910111213141516171819202122232425262728293031 cond 00 I op s rn rd op2 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm

How to add the multiplication?

cond 00 0 000 a s rn rd rs 1001 rm

Gabriel Laskar (EPITA) CAAL 2015 348 / 378 Focus on RISC processors Some instruction sets ARM Thumb mode

One goal: Code compression. Concessions to pack a maximum of operations in a minimal space:

I Access to only 8 registers among 16

I Implicit stack ops

I Implicit pc-relative operations

I No predicates any more Dirty hacks for making it work:

I Use lower bit of r15 as a mode indicator

I Use a new bx/blx instruction

Gabriel Laskar (EPITA) CAAL 2015 349 / 378 I 16 bit instruction words are not enough any more Never mind

I Keep the basic 16-bit thumb encoding

I When we need more, just make the instruction 32-bits long...

I Decode mixed 16/32-bits instructions

I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding!

Focus on RISC processors Some instruction sets ARM Thumb2

Keep up with thumb, this is great. Ideas:

I Have an asm-code compatibility with ARM

I Remap the whole ARM 32-bit instruction set in 16-bit thumb

Gabriel Laskar (EPITA) CAAL 2015 350 / 378 Never mind

I Keep the basic 16-bit thumb encoding

I When we need more, just make the instruction 32-bits long...

I Decode mixed 16/32-bits instructions

I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding!

Focus on RISC processors Some instruction sets ARM Thumb2

Keep up with thumb, this is great. Ideas:

I Have an asm-code compatibility with ARM

I Remap the whole ARM 32-bit instruction set in 16-bit thumb

I 16 bit instruction words are not enough any more

Gabriel Laskar (EPITA) CAAL 2015 350 / 378 I Decode mixed 16/32-bits instructions

I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding!

Focus on RISC processors Some instruction sets ARM Thumb2

Keep up with thumb, this is great. Ideas:

I Have an asm-code compatibility with ARM

I Remap the whole ARM 32-bit instruction set in 16-bit thumb

I 16 bit instruction words are not enough any more Never mind

I Keep the basic 16-bit thumb encoding

I When we need more, just make the instruction 32-bits long...

Gabriel Laskar (EPITA) CAAL 2015 350 / 378 I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding!

Focus on RISC processors Some instruction sets ARM Thumb2

Keep up with thumb, this is great. Ideas:

I Have an asm-code compatibility with ARM

I Remap the whole ARM 32-bit instruction set in 16-bit thumb

I 16 bit instruction words are not enough any more Never mind

I Keep the basic 16-bit thumb encoding

I When we need more, just make the instruction 32-bits long...

I Decode mixed 16/32-bits instructions

Gabriel Laskar (EPITA) CAAL 2015 350 / 378 Then Thumb-2 is a RISC with a CISC encoding!

Focus on RISC processors Some instruction sets ARM Thumb2

Keep up with thumb, this is great. Ideas:

I Have an asm-code compatibility with ARM

I Remap the whole ARM 32-bit instruction set in 16-bit thumb

I 16 bit instruction words are not enough any more Never mind

I Keep the basic 16-bit thumb encoding

I When we need more, just make the instruction 32-bits long...

I Decode mixed 16/32-bits instructions

I Only 16-bit alignment constraint for 32-bit long thumb-2 ops

Gabriel Laskar (EPITA) CAAL 2015 350 / 378 Focus on RISC processors Some instruction sets ARM Thumb2

Keep up with thumb, this is great. Ideas:

I Have an asm-code compatibility with ARM

I Remap the whole ARM 32-bit instruction set in 16-bit thumb

I 16 bit instruction words are not enough any more Never mind

I Keep the basic 16-bit thumb encoding

I When we need more, just make the instruction 32-bits long...

I Decode mixed 16/32-bits instructions

I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding!

Gabriel Laskar (EPITA) CAAL 2015 350 / 378 CPU-aware optimizations

Part X

CPU-aware optimizations

Gabriel Laskar (EPITA) CAAL 2015 351 / 378 CPU-aware optimizations Rationale 10: CPU-aware optimizations

Rationale

Examples

Code readability guidelines

Gabriel Laskar (EPITA) CAAL 2015 352 / 378 CPU-aware optimizations Rationale Purpose

Knowing the low-level implementation of the processors, we can:

I write code easier for the compiler to translate I write code easier for the CPU to run

I because nicer with the pipeline I because nicer with the memory subsystem

Gabriel Laskar (EPITA) CAAL 2015 353 / 378 CPU-aware optimizations Examples 10: CPU-aware optimizations

Rationale

Examples abs() Sign extension Power of two detection Mask merging

Code readability guidelines

Gabriel Laskar (EPITA) CAAL 2015 354 / 378 CPU-aware optimizations Examples abs() 10: CPU-aware optimizations

Rationale

Examples abs() Sign extension Power of two detection Mask merging

Code readability guidelines

Gabriel Laskar (EPITA) CAAL 2015 355 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs()

The absolute value is a good example of optimized code which is easy to implement in an efficient and smart way.

Gabriel Laskar (EPITA) CAAL 2015 356 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline abs() basic code

int abs(int x) { if (x<0) return -x; else return x; } int abs(int x) { return (x<0)?-x : x; }

Gabriel Laskar (EPITA) CAAL 2015 357 / 378 I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return

I if (x > 0), return

I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return

I if (x > 0), return

I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return

I if (x > 0), return

I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return

I if (x > 0), return

I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I if (x < 0), return

I if (x > 0), return

I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return -x

I if (x > 0), return x

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return ((not x) + 1)

I if (x > 0), return x

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return ((x ^ 0xffffffff) - 0xffffffff)

I if (x > 0), return ((x ^ 0) + 0)

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31)

CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return ((x ^ 0xffffffff) - 0xffffffff)

I if (x > 0), return ((x ^ 0) + 0) How to produce -1 when x < 0?

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles:

I Addition: a + 1 = a - (-1)

I Number representation: -1 = 0xffffffff

I Negation: -a = (not a) + 1

I not a = a xor 0xffffffff = a xor -1

I if (x < 0), return ((x ^ 0xffffffff) - 0xffffffff)

I if (x > 0), return ((x ^ 0) + 0) How to produce -1 when x < 0?

I Sign extension: (x » 31)

Gabriel Laskar (EPITA) CAAL 2015 358 / 378 int abs(int x) { int sign_word=-(1& (x >> 31)); return (x^ sign_word)- sign_word; }

CPU-aware optimizations Examples abs() Nicer with the pipeline abs() better code

int abs(int x) { int sign_word=x >> 31; return (x^ sign_word)- sign_word; }

Gabriel Laskar (EPITA) CAAL 2015 359 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline abs() better code

int abs(int x) { int sign_word=x >> 31; return (x^ sign_word)- sign_word; } int abs(int x) { int sign_word=-(1& (x >> 31)); return (x^ sign_word)- sign_word; }

Gabriel Laskar (EPITA) CAAL 2015 359 / 378 CPU-aware optimizations Examples Sign extension 10: CPU-aware optimizations

Rationale

Examples abs() Sign extension Power of two detection Mask merging

Code readability guidelines

Gabriel Laskar (EPITA) CAAL 2015 360 / 378 -high 11111111111111111111000000000000 result ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 361 / 378 result ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn -high 11111111111111111111000000000000

Gabriel Laskar (EPITA) CAAL 2015 361 / 378 int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn -high 11111111111111111111000000000000 result ssssssssssssssssssssnnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 361 / 378 CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn -high 11111111111111111111000000000000 result ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; }

Gabriel Laskar (EPITA) CAAL 2015 361 / 378 « snnnnnnnnnnnn0000000000000000000 » ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

x 0000000000000000000snnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 362 / 378 » ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

x 0000000000000000000snnnnnnnnnnnn « snnnnnnnnnnnn0000000000000000000

Gabriel Laskar (EPITA) CAAL 2015 362 / 378 int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

x 0000000000000000000snnnnnnnnnnnn « snnnnnnnnnnnn0000000000000000000 » ssssssssssssssssssssnnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 362 / 378 CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size

x 0000000000000000000snnnnnnnnnnnn « snnnnnnnnnnnn0000000000000000000 » ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; }

Gabriel Laskar (EPITA) CAAL 2015 362 / 378 sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; }

CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn

Gabriel Laskar (EPITA) CAAL 2015 363 / 378 CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high;

} Gabriel Laskar (EPITA) CAAL 2015 363 / 378 CPU-aware optimizations Examples Power of two detection 10: CPU-aware optimizations

Rationale

Examples abs() Sign extension Power of two detection Mask merging

Code readability guidelines

Gabriel Laskar (EPITA) CAAL 2015 364 / 378 I Lower bits up to the lowest 1 get flipped

I Other bits stay the same int is_pow2(int n) { return !(n& (n-1)); }

CPU-aware optimizations Examples Power of two detection Power of two Properties

I Powers of two have only one “1” bit

I Substracting 1 from a power of two flips the only “1” Examples:

I 1110010 - 1 = 1110001

I 0100000-1=0011111

Gabriel Laskar (EPITA) CAAL 2015 365 / 378 int is_pow2(int n) { return !(n& (n-1)); }

CPU-aware optimizations Examples Power of two detection Power of two Properties

I Powers of two have only one “1” bit

I Substracting 1 from a power of two flips the only “1” Examples:

I 1110010 - 1 = 1110001

I 0100000-1=0011111

I Lower bits up to the lowest 1 get flipped

I Other bits stay the same

Gabriel Laskar (EPITA) CAAL 2015 365 / 378 CPU-aware optimizations Examples Power of two detection Power of two Properties

I Powers of two have only one “1” bit

I Substracting 1 from a power of two flips the only “1” Examples:

I 1110010 - 1 = 1110001

I 0100000-1=0011111

I Lower bits up to the lowest 1 get flipped

I Other bits stay the same int is_pow2(int n) { return !(n& (n-1)); }

Gabriel Laskar (EPITA) CAAL 2015 365 / 378 CPU-aware optimizations Examples Mask merging 10: CPU-aware optimizations

Rationale

Examples abs() Sign extension Power of two detection Mask merging

Code readability guidelines

Gabriel Laskar (EPITA) CAAL 2015 366 / 378 int mask_merge(int mask, int where_0, int where_1) { return (mask& where_1)|(~mask& where_0); }

CPU-aware optimizations Examples Mask merging Mask merging

01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 mask 0 0 1 1 1 1 0 0 result 1 1 0 0 1 1 0 0

Gabriel Laskar (EPITA) CAAL 2015 367 / 378 CPU-aware optimizations Examples Mask merging Mask merging

01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 mask 0 0 1 1 1 1 0 0 result 1 1 0 0 1 1 0 0

int mask_merge(int mask, int where_0, int where_1) { return (mask& where_1)|(~mask& where_0); }

Gabriel Laskar (EPITA) CAAL 2015 367 / 378 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0

int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; }

CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0

Gabriel Laskar (EPITA) CAAL 2015 368 / 378 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0

int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; }

CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0

Gabriel Laskar (EPITA) CAAL 2015 368 / 378 int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; }

CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0

Gabriel Laskar (EPITA) CAAL 2015 368 / 378 CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0

int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; }

Gabriel Laskar (EPITA) CAAL 2015 368 / 378 CPU-aware optimizations Code readability guidelines 10: CPU-aware optimizations

Rationale

Examples

Code readability guidelines

Gabriel Laskar (EPITA) CAAL 2015 369 / 378 CPU-aware optimizations Code readability guidelines Code readability

If you ever code with this kind of hack

I always create functions with explicit name and prototype

I eventually document the intended behavior

I may use a static inline int abs(int x); int sign_ext(int val, int bits); int is_pow2(int n); int mask_merge(int mask, int where_0, int where_1);

Gabriel Laskar (EPITA) CAAL 2015 370 / 378 Multi-/Many-core, heterogeneous systems

Part XI

Multi-/Many-core, heterogeneous systems

Gabriel Laskar (EPITA) CAAL 2015 371 / 378 Multi-/Many-core, heterogeneous systems Topology evolution 11: Multi-/Many-core, heterogeneous systems

Topology evolution

Challenges

Gabriel Laskar (EPITA) CAAL 2015 372 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

CPU

RAM Video

Keyboard Floppy

Sound Serial

Network Hard Disk

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

CPU

RAM Bridge Bridge

Network Video Keyboard

Sound Hard Disk Floppy

Serial

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

CPU CPU

RAM Bridge Bridge

Network Video Keyboard

Sound Hard Disk Floppy

Serial

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

CPU CPU

RAM Bridge Video

Bridge Bridge

Keyboard Network

Sound Hard Disk Floppy

USB Serial

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

CPU CPU

RAM Bridge Video

Bridge PCI ATA Floppy

Net Kbd

USB Mouse USB Sound Serial

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

CPU CPU

RAM Bridge Video

Bridge PCI ATA SATA Floppy

Net Wlan Kbd

USB Mouse USB Sound SD Serial

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

Many core Many core CPU CPU

Video

RAM Bridge

PCI−e

Bridge PCI ATA SATA Floppy

Net Wlan Kbd

USB Mouse USB Sound SD Serial

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies

Many core Many core RAM RAM CPU CPU

PCI−e Video

Bridge

PCI−e

Bridge PCI ATASATA Floppy

Net Wlan Kbd

USB Mouse USB Sound SD Serial

Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution Future of hardware topologies?

GPGPU GPGPU Many core Many core CPU CPU RAM RAM

GPGPU GPGPU Many core Many core CPU CPU RAM RAM

GPGPU GPGPU Many core Many core CPU CPU RAM RAM

Bridge PCI ATASATA Floppy Video Net Wlan Kbd

USB Mouse USB Sound SD Serial

Gabriel Laskar (EPITA) CAAL 2015 374 / 378 Multi-/Many-core, heterogeneous systems Topology evolution Hardware topologies Observations

Busses don’t scale

I Bandwidth is shared among connected peers

I They need rest cycles between elections We don’t have busses any more

I We use networks (QPI, HyperTransport)

I This forbids snooping

I Coherence has to be done explicitely

Gabriel Laskar (EPITA) CAAL 2015 375 / 378 Multi-/Many-core, heterogeneous systems Topology evolution NUMA Observations

I There are more and more cores

I Memory gets closer to the cores

I Memory connections get distributed among cores

I Systems become NUMA

Gabriel Laskar (EPITA) CAAL 2015 376 / 378 Multi-/Many-core, heterogeneous systems Challenges 11: Multi-/Many-core, heterogeneous systems

Topology evolution

Challenges

Gabriel Laskar (EPITA) CAAL 2015 377 / 378 Multi-/Many-core, heterogeneous systems Challenges Some scalability bottlenecks

Coherent shared-memory systems

I are hard to design

I dont scale well

I get slower with the load NUMA systems

I are hard to program for

I are not well supported in all OSes Uncoherent shared-memory systems are not ready for prime-time yet

Gabriel Laskar (EPITA) CAAL 2015 378 / 378