Computer Architecture and Assembly Language
Gabriel Laskar
EPITA
2015 License
I Copyright c 2004-2005, ACU, Benoit Perrot
I Copyright c 2004-2008, Alexandre Becoulet
I Copyright c 2009-2013, Nicolas Pouillon
I Copyright c 2014, Joël Porquet
I Copyright c 2015, Gabriel Laskar
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being just ‘‘Copying this document’’, no Front-Cover Texts, and no Back-Cover Texts. Introduction
Part I
Introduction
Gabriel Laskar (EPITA) CAAL 2015 3 / 378 Introduction Problem definition 1: Introduction
Problem definition
Outline
Gabriel Laskar (EPITA) CAAL 2015 4 / 378 Introduction Problem definition What are we trying to learn? Computer Architecture
What is in the hardware?
I A bit of history of computers, current machines
I Concepts and conventions: processing, memory, communication, optimization
How does a machine run code?
I Program execution model
I Memory mapping, OS support
Gabriel Laskar (EPITA) CAAL 2015 5 / 378 Introduction Problem definition What are we trying to learn? Assembly Language
How to “talk” with the machine directly?
I Mechanisms involved
I Assembly language structure and usage
I Low-level assembly language features
I C inline assembly
Gabriel Laskar (EPITA) CAAL 2015 6 / 378 I Programmers
I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers C/C++
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers C/C++, Objective-C
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers C/C++, Objective-C, C#
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers C/C++, Objective-C, C#, Java
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers C/C++, Objective-C, C#, Java, JS/AS
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 I Wise managers
Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers at any level: (C/C++, Objective-C, C#, Java, JS/AS, etc.)
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 Introduction Problem definition Who do I talk to?
I System gurus
I Low-level enthusiasts
I Programmers at any level: (C/C++, Objective-C, C#, Java, JS/AS, etc.)
I Wise managers
Gabriel Laskar (EPITA) CAAL 2015 7 / 378 Introduction Outline 1: Introduction
Problem definition
Outline
Gabriel Laskar (EPITA) CAAL 2015 8 / 378 Introduction Outline Course outline
I Processor architecture
I Memory
I Memory mapping
I Execution flow
I Object file formats
I Assembly programming
I Focus on x86
I Focus on RISC processors
I CPU-aware optimizations
I Multi-/Many-core, heterogeneous systems
Gabriel Laskar (EPITA) CAAL 2015 9 / 378 Processor architecture
Part II
Processor architecture
Gabriel Laskar (EPITA) CAAL 2015 10 / 378 Processor architecture Overview 2: Processor architecture
Overview
Inside the processor
Processor units
Instructions
Instruction flow
Pipeline processor
CISC & RISC architectures
Gabriel Laskar (EPITA) CAAL 2015 11 / 378 Let’s design it!
Processor architecture Overview What a processor is...
A processor must be able to perform the following basic tasks:
I Execute instructions
I Read operands
I Store results
It needs several basic units to perform those tasks:
I A control unit
I An arithmetic and logical unit (ALU)
I A register bank
Gabriel Laskar (EPITA) CAAL 2015 12 / 378 Processor architecture Overview What a processor is...
A processor must be able to perform the following basic tasks:
I Execute instructions
I Read operands
I Store results
It needs several basic units to perform those tasks:
I A control unit
I An arithmetic and logical unit (ALU)
I A register bank
Let’s design it!
Gabriel Laskar (EPITA) CAAL 2015 12 / 378 Processor architecture Overview Basic architecture
R0 R1 R2 = 0123 R3 = 0100 R4 = 0023
R5 Registers
ALU Control Unit
Gabriel Laskar (EPITA) CAAL 2015 13 / 378 Unfortunately,
I Internal memory is expensive and hard to design
I There is no communication
I Updating the program may not be easy
We need an access to memory, external devices, etc.
Processor architecture Overview Basic architecture (2)
In this model, the system state is entirely contained in the processor.
I This might be sufficient for a very basic processor
I More features could be leveraged by adding registers or program steps
Gabriel Laskar (EPITA) CAAL 2015 14 / 378 We need an access to memory, external devices, etc.
Processor architecture Overview Basic architecture (2)
In this model, the system state is entirely contained in the processor.
I This might be sufficient for a very basic processor
I More features could be leveraged by adding registers or program steps
Unfortunately,
I Internal memory is expensive and hard to design
I There is no communication
I Updating the program may not be easy
Gabriel Laskar (EPITA) CAAL 2015 14 / 378 Processor architecture Overview Basic architecture (2)
In this model, the system state is entirely contained in the processor.
I This might be sufficient for a very basic processor
I More features could be leveraged by adding registers or program steps
Unfortunately,
I Internal memory is expensive and hard to design
I There is no communication
I Updating the program may not be easy
We need an access to memory, external devices, etc.
Gabriel Laskar (EPITA) CAAL 2015 14 / 378 It needs several basic units to perform those tasks:
I A control unit
I An arithmetic and logical unit (ALU)
I A register bank
I A program memory access
I A data memory access
Processor architecture Overview Revised processor model
A processor must be able to perform the following basic tasks:
I Fetch instructions from an external entity and understand them (fetch and decode)
I Execute instructions
I Store results to registers or external memory
Gabriel Laskar (EPITA) CAAL 2015 15 / 378 Processor architecture Overview Revised processor model
A processor must be able to perform the following basic tasks:
I Fetch instructions from an external entity and understand them (fetch and decode)
I Execute instructions
I Store results to registers or external memory
It needs several basic units to perform those tasks:
I A control unit
I An arithmetic and logical unit (ALU)
I A register bank
I A program memory access
I A data memory access
Gabriel Laskar (EPITA) CAAL 2015 15 / 378 Processor architecture Overview Revised processor model (2)
R0 R1 R2 = 0123 R3 = 0100 R4 = 0023
R5 Registers
ALU Control Unit Instruction @ Code Command R Data Data @ W Data
Gabriel Laskar (EPITA) CAAL 2015 16 / 378 Processor architecture Overview Processor physical layout
A processor is composed of many different units:
I Caches, MMU
I Integer unit, Control unit, Floating-point unit Each unit is:
I implemented as an hardware component
I made of switchable parts (transistors)
In old processors:
I Units used to be independent chips
I Some were even optional “coprocessors” Today, processors are embedded on a single chip.
Gabriel Laskar (EPITA) CAAL 2015 17 / 378 Processor architecture Inside the processor 2: Processor architecture
Overview
Inside the processor
Processor units
Instructions
Instruction flow
Pipeline processor
CISC & RISC architectures
Gabriel Laskar (EPITA) CAAL 2015 18 / 378 Processor architecture Inside the processor Processor package
Gabriel Laskar (EPITA) CAAL 2015 19 / 378 Processor architecture Inside the processor Processor physical layout
Gabriel Laskar (EPITA) CAAL 2015 20 / 378 Processor architecture Inside the processor Processor physical layout
Gabriel Laskar (EPITA) CAAL 2015 21 / 378 Processor architecture Inside the processor Processor physical layout Transistor details
Gabriel Laskar (EPITA) CAAL 2015 22 / 378 Processor architecture Processor units 2: Processor architecture
Overview
Inside the processor
Processor units
Instructions
Instruction flow
Pipeline processor
CISC & RISC architectures
Gabriel Laskar (EPITA) CAAL 2015 23 / 378 Processor architecture Processor units Units
R0 R1 R2 = 0123 R3 = 0100 I Control unit fetches and R4 = 0023
decodes the instruction R5 Registers
I Registers gives the data
I ALU implements the
operation ALU I Some instructions access Control Unit external data Instruction @ Code Command R Data Data @ W Data
Gabriel Laskar (EPITA) CAAL 2015 24 / 378 Processor architecture Processor units Registers May be seen as variables located inside the processor
I 8, 16, 32, 64, 128, ... -bits large I General-purpose registers:
I integer (int) I floating point (float, double) I Specialized registers:
I flags
I Zero, I Negative, I Carry, I Overflow, I etc.
I system
I Mode, I IRQ masking, I etc.
Gabriel Laskar (EPITA) CAAL 2015 25 / 378 Processor architecture Processor units ALU: Arithmetic and Logical Unit An unit without registers!
I Logical operations
I AND, OR, XOR, NOT, NOR I Arithmetic operations
I addition, subtraction, multiplication, division
I Shifts
I Compares
Division is not possible without registers!
Gabriel Laskar (EPITA) CAAL 2015 26 / 378 Processor architecture Instructions 2: Processor architecture
Overview
Inside the processor
Processor units
Instructions
Instruction flow
Pipeline processor
CISC & RISC architectures
Gabriel Laskar (EPITA) CAAL 2015 27 / 378 Processor architecture Instructions Instruction types
There are different instruction types:
I Arithmetic and logical operations
I Control instructions
I Memory access instructions
Gabriel Laskar (EPITA) CAAL 2015 28 / 378 Processor architecture Instructions Instructions classification
Flynn’s taxonomy:
I SISD: Single Instruction, Single Data Classical Von Neumann architecture
I SIMD: Single Instruction, Multiple Data Vectorial computers
I MIMD: Multiple Instruction, Multiple Data Multiprocessor computers
Gabriel Laskar (EPITA) CAAL 2015 29 / 378 Processor architecture Instruction flow 2: Processor architecture
Overview
Inside the processor
Processor units
Instructions
Instruction flow
Pipeline processor
CISC & RISC architectures
Gabriel Laskar (EPITA) CAAL 2015 30 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch
jmp
ld PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod
jmp
ld PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec
jmp
ld PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp
ld PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch
ld PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch Decod
ld PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch Decod Exec
ld PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch Decod Exec
ld Fetch PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch Decod Exec
ld Fetch Decod PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch Decod Exec
ld Fetch Decod Exec PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch Decod Exec
ld Fetch Decod Exec Mem PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Processor architecture Instruction flow Microprogrammed processor Instruction flow
Processors may use a finite state machine or micro ops to process each instruction step (Fetch, Decode, Execute, Memory access, Register write back, etc.)
A basic processor needs several clock cycles to process all steps of the current instruction before starting the next one.
sub Fetch Decod Exec Write
jmp Fetch Decod Exec
ld Fetch Decod Exec Mem Write PC address
Gabriel Laskar (EPITA) CAAL 2015 31 / 378 Pros:
I Easy to implement
I Small
I New instructions can be added just by adding new steps
Processor architecture Instruction flow Microprogrammed processor Pro/cons
Cons:
I Slow
I The more complex the instructions are, the longer they take to get processed
I Most of the hardware is used only once for each instruction
I Most of the hardware is unused most of the time
Gabriel Laskar (EPITA) CAAL 2015 32 / 378 Processor architecture Instruction flow Microprogrammed processor Pro/cons
Cons:
I Slow
I The more complex the instructions are, the longer they take to get processed
I Most of the hardware is used only once for each instruction
I Most of the hardware is unused most of the time
Pros:
I Easy to implement
I Small
I New instructions can be added just by adding new steps
Gabriel Laskar (EPITA) CAAL 2015 32 / 378 Processor architecture Pipeline processor 2: Processor architecture
Overview
Inside the processor
Processor units
Instructions
Instruction flow
Pipeline processor
CISC & RISC architectures
Gabriel Laskar (EPITA) CAAL 2015 33 / 378 Processor architecture Pipeline processor Pipelined processor Instruction flow A pipeline architecture enables the parallel execution of several instructions:
I Split the execution of each instruction in several steps I Each step performs an elementary operation I Each step is associated to a specific part of the hardware I All parts of the hardware work in parallel
Fetch Decode Exec Memory Writeback
Instruction Data memory memory access access
Gabriel Laskar (EPITA) CAAL 2015 34 / 378 Processor architecture Pipeline processor Pipeline Flow
? Fetch Instruction sequence
Pipeline timeline
Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow
sub Fetch Decod
? Fetch Instruction sequence
Pipeline timeline
Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow
sub Fetch Decod Exec
not Fetch Decod
? Fetch Instruction sequence
Pipeline timeline
Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow
sub Fetch Decod Exec Mem
not Fetch Decod Exec
add Fetch Decod
? Fetch Instruction sequence
Pipeline timeline
Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow
sub Fetch Decod Exec Mem Write
not Fetch Decod Exec Mem
add Fetch Decod Exec
jmp Fetch Decod Instruction sequence ? Fetch
Pipeline timeline
Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow
sub Fetch Decod Exec Mem Write
not Fetch Decod Exec Mem Write
add Fetch Decod Exec Mem
jmp Fetch Decod Exec Instruction sequence sub Fetch Decod
? Fetch
Pipeline timeline
Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Pipeline Flow
sub Fetch Decod Exec Mem Write
not Fetch Decod Exec Mem Write
add Fetch Decod Exec Mem Write
jmp Fetch Decod Exec Mem Instruction sequence sub Fetch Decod Exec
add Fetch Decod
? Fetch
Pipeline timeline
Gabriel Laskar (EPITA) CAAL 2015 35 / 378 Processor architecture Pipeline processor Speeding up the pipeline
Current processors extensively use the pipeline architecture to accelerate the execution of instructions.
Because a pipeline architecture works in parallel, the slowest step delay determines the pipeline global cycle delay (and working frequency).
Step 1 Step 2 Step 3
10 ns 18 ns 8 ns
Gabriel Laskar (EPITA) CAAL 2015 36 / 378 Processor architecture Pipeline processor Speeding up the pipeline Splitting operations in shorter steps enables the processor frequency to increase.
Step 1 Step 2 Step 3
10 ns 18 ns 8 ns
Step 1 Step 2.1 Step 2.2 Step 3
10 ns 9 ns 9 ns 8 ns
Gabriel Laskar (EPITA) CAAL 2015 37 / 378 Processor architecture CISC & RISC architectures 2: Processor architecture
Overview
Inside the processor
Processor units
Instructions
Instruction flow
Pipeline processor
CISC & RISC architectures
Gabriel Laskar (EPITA) CAAL 2015 38 / 378 Processor architecture CISC & RISC architectures Instruction-set classification
Based on internal architecture and instructions formats, processor architectures may be classified in two groups:
I Complex Instruction Set Computer (CISC)
I Reduced Instruction Set Computer (RISC)
Early processor architectures were mostly CISC-based: z80, Intel x86, Motorola 68000, etc.
More recent designs are rather RISC-based: MIPS, Sparc, Alpha, PowerPC, ARM, etc.
Gabriel Laskar (EPITA) CAAL 2015 39 / 378 Processor architecture CISC & RISC architectures RISC Characteristics
Pros:
I Simple instructions
I Fixed-length instructions
I Decoding instructions requires simple hardware
Cons:
I Programs are longer as they need more instructions
I Optimization is harder, compilers need to be smarter
Sometimes said as “RejectImportantStuff intoCompiler”
Gabriel Laskar (EPITA) CAAL 2015 40 / 378 Processor architecture CISC & RISC architectures RISC Instruction example sub %g1, %g2, %g3
0x860x20 0x40 0x02
format destination opcode source unused source
10 00011 000100 00001 000000000 00010
%g3 sub %g1 %g2
Gabriel Laskar (EPITA) CAAL 2015 41 / 378 Processor architecture CISC & RISC architectures CISC Characteristics Pros:
I Lots of instructions and opcodes
I A single instruction can perform complex operations
I Assembly programs are easier and shorter to write
I Code compression ratio is good
Cons:
I Binary instruction format has variable length
I It requires more complex hardware and high frequencies are harder to achieve
Modern processors often internally translate the CISC code to RISC microcode
Gabriel Laskar (EPITA) CAAL 2015 42 / 378 Processor architecture CISC & RISC architectures CISC Instruction example
opcode opc2 mod immediate memory r/m displacement
Gabriel Laskar (EPITA) CAAL 2015 43 / 378 Memory
Part III
Memory
Gabriel Laskar (EPITA) CAAL 2015 44 / 378 Memory Memories 3: Memory
Memories Memory types Access examples
Memory accessing modes
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 45 / 378 Memory Memories Memory types 3: Memory
Memories Memory types Access examples
Memory accessing modes
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 46 / 378 Memory Memories Memory types Reasons to access memory
How does the memory work with the processor?
I Memory is used to fetch instructions,
I Memory is used to access data. I There may be one unique memory, or two.
I If there is one memory, instruction / data accesses must be sequencialized I If there are two, code cannot be accessed as data Conventional names: 1 Von Neuman architecture 2 Harvard architecture
Gabriel Laskar (EPITA) CAAL 2015 47 / 378 Memory Memories Access examples 3: Memory
Memories Memory types Access examples
Memory accessing modes
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 48 / 378 Memory Memories Access examples Instruction fetch The processor needs an instruction to process
Processor
Memory
Gabriel Laskar (EPITA) CAAL 2015 49 / 378 Memory Memories Access examples Instruction fetch The processor needs an instruction to process
Processor Instruction @
Memory
Gabriel Laskar (EPITA) CAAL 2015 49 / 378 Memory Memories Access examples Instruction fetch The processor needs an instruction to process
Processor Instruction @ Code
Memory
Gabriel Laskar (EPITA) CAAL 2015 49 / 378 Memory Memories Access examples Data access load Load the content of the memory cells pointed to by %g1 into %g2 register.
Processor
Memory
ld[%g1], %g2
Gabriel Laskar (EPITA) CAAL 2015 50 / 378 Memory Memories Access examples Data access load Load the content of the memory cells pointed to by %g1 into %g2 register.
Processor Command Data @
Memory
ld[%g1], %g2
Gabriel Laskar (EPITA) CAAL 2015 50 / 378 Memory Memories Access examples Data access load Load the content of the memory cells pointed to by %g1 into %g2 register.
Processor Command R Data Data @
Memory
ld[%g1], %g2
Gabriel Laskar (EPITA) CAAL 2015 50 / 378 Memory Memories Access examples Data access store Stores the content of %g2 register into the memory cells pointed to by %g1.
Processor
Memory
st %g2,[%g1]
Gabriel Laskar (EPITA) CAAL 2015 51 / 378 Memory Memories Access examples Data access store Stores the content of %g2 register into the memory cells pointed to by %g1.
Processor Command Data @ W Data
Memory
st %g2,[%g1]
Gabriel Laskar (EPITA) CAAL 2015 51 / 378 Memory Memory accessing modes 3: Memory
Memories
Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 52 / 378 Memory Memory accessing modes Immediate addressing 3: Memory
Memories
Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 53 / 378 In C language: int a, b= ...;
a=b+ 0x831; In assembly language: add %g1, 0x831, %g2
Memory Memory accessing modes Immediate addressing Immediate addressing
I The value of data is directly stored in the instruction
I No memory access needed to get the value
Gabriel Laskar (EPITA) CAAL 2015 54 / 378 In assembly language: add %g1, 0x831, %g2
Memory Memory accessing modes Immediate addressing Immediate addressing
I The value of data is directly stored in the instruction
I No memory access needed to get the value In C language: int a, b= ...;
a=b+ 0x831;
Gabriel Laskar (EPITA) CAAL 2015 54 / 378 Memory Memory accessing modes Immediate addressing Immediate addressing
I The value of data is directly stored in the instruction
I No memory access needed to get the value In C language: int a, b= ...;
a=b+ 0x831; In assembly language: add %g1, 0x831, %g2
Gabriel Laskar (EPITA) CAAL 2015 54 / 378 Memory Memory accessing modes Immediate addressing Immediate addressing Sparc instruction details
sub %g1, 0x831, %g2
0x84 0x20 0x68 0x31
format destination opcode source immediate
10 00010 000100 00001 1 0 1000 0011 0001
%g2 sub %g1 0x831
Gabriel Laskar (EPITA) CAAL 2015 55 / 378 Memory Memory accessing modes Absolute addressing 3: Memory
Memories
Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 56 / 378 In C language: int a=*( int*)0x830; In assembly language: ld[0x830], %g1
Memory Memory accessing modes Absolute addressing Absolute addressing
I The address of the data is stored in the instruction
I A memory access is needed to get the value
Gabriel Laskar (EPITA) CAAL 2015 57 / 378 In assembly language: ld[0x830], %g1
Memory Memory accessing modes Absolute addressing Absolute addressing
I The address of the data is stored in the instruction
I A memory access is needed to get the value In C language: int a=*( int*)0x830;
Gabriel Laskar (EPITA) CAAL 2015 57 / 378 Memory Memory accessing modes Absolute addressing Absolute addressing
I The address of the data is stored in the instruction
I A memory access is needed to get the value In C language: int a=*( int*)0x830; In assembly language: ld[0x830], %g1
Gabriel Laskar (EPITA) CAAL 2015 57 / 378 Memory Memory accessing modes Register indirect addressing 3: Memory
Memories
Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 58 / 378 In C language: int a,*b= ...;
a=*b; In assembly language: ld[%g2], %g1
Memory Memory accessing modes Register indirect addressing Register indirect addressing
I The address of the data is stored in a register
I A memory access is needed to get the value
Gabriel Laskar (EPITA) CAAL 2015 59 / 378 In assembly language: ld[%g2], %g1
Memory Memory accessing modes Register indirect addressing Register indirect addressing
I The address of the data is stored in a register
I A memory access is needed to get the value In C language: int a,*b= ...;
a=*b;
Gabriel Laskar (EPITA) CAAL 2015 59 / 378 Memory Memory accessing modes Register indirect addressing Register indirect addressing
I The address of the data is stored in a register
I A memory access is needed to get the value In C language: int a,*b= ...;
a=*b; In assembly language: ld[%g2], %g1
Gabriel Laskar (EPITA) CAAL 2015 59 / 378 Memory Memory accessing modes Complex addressing 3: Memory
Memories
Memory accessing modes Immediate addressing Absolute addressing Register indirect addressing Complex addressing
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 60 / 378 Memory Memory accessing modes Complex addressing Complex addressing
I Register indirect with base register
I Register indirect with offset
I And many others... Assembly example: ld[%g2+ 0x124], %g1 ld[%g2+ %g3], %g1
Gabriel Laskar (EPITA) CAAL 2015 61 / 378 Memory Alignment 3: Memory
Memories
Memory accessing modes
Alignment Memory access alignment Structure alignment packed structures
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 62 / 378 Memory Alignment Memory access alignment 3: Memory
Memories
Memory accessing modes
Alignment Memory access alignment Structure alignment packed structures
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 63 / 378 Memory Alignment Memory access alignment Definition
Data access alignment is solely about considered data type width.
I 32-bit integer access is aligned for addresses multiple of 4
I 16-bit integer access is aligned for even addresses
I 8-bit integer (char) accesses are always aligned! Think about address % sizeof(type)
Gabriel Laskar (EPITA) CAAL 2015 64 / 378 Memory Alignment Memory access alignment Access alignment Bus Width
0x00 Address 0x01 0x34 0x56 0x78 0x04
0x08 0x89 0xab 0x0c 0xcd0xef 0x9f 0x8c 0x10 0x5a 0x6b
Gabriel Laskar (EPITA) CAAL Data read 2015 65 / 378 Memory Alignment Structure alignment 3: Memory
Memories
Memory accessing modes
Alignment Memory access alignment Structure alignment packed structures
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 66 / 378 Memory Alignment Structure alignment Structure alignment
In a structure:
I Fields must be in declaration order
I Fields must all be aligned Data alignment does not depend on architecture bus width
struct bit_packed_s { int a; 015 16 32 short b; a short c; }; b c
Gabriel Laskar (EPITA) CAAL 2015 67 / 378 I Compilers put structure fields at aligned offsets I Alignment may add unused padding bytes between fields struct example_aligned_s { char a; 015 16 32 int b; a short c; }; b
c
padding bytes
Memory Alignment Structure alignment Padding Sometimes, consecutive fields in declaration cannot be consecutive and aligned in memory
Gabriel Laskar (EPITA) CAAL 2015 68 / 378 Memory Alignment Structure alignment Padding Sometimes, consecutive fields in declaration cannot be consecutive and aligned in memory
I Compilers put structure fields at aligned offsets I Alignment may add unused padding bytes between fields struct example_aligned_s { char a; 015 16 32 int b; a short c; }; b
c
padding bytes Gabriel Laskar (EPITA) CAAL 2015 68 / 378 Memory Alignment packed structures 3: Memory
Memories
Memory accessing modes
Alignment Memory access alignment Structure alignment packed structures
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 69 / 378 Memory Alignment packed structures Basic packing
I Fields alignment can be ignored by compiler, on request
I Few architectures are able to access non aligned fields directly
I If non-native, unaligned access is emulated with multiple memory accesses, shifts, ORs, etc.
struct example_packed_s { 015 16 32 char a; a b int b; short c; c } __attribute__((packed));
Gabriel Laskar (EPITA) CAAL 2015 70 / 378 Memory Alignment packed structures Low-level packing
I Packing can even be done at bit level! I Compiler will handle shifts and masks I Can be mixed with union I Powerful for matching existing protocols struct bit_packed_s { int a:17; 015 16 32 int b:5; a b c int c:12; int d:16; d e int e:14; } __attribute__((packed));
Beware of endianness Gabriel Laskar (EPITA) CAAL 2015 71 / 378 Memory Endianness 3: Memory
Memories
Memory accessing modes
Alignment
Endianness Different designs, different paradigms Explaination Demo
Caches
Gabriel Laskar (EPITA) CAAL 2015 72 / 378 Memory Endianness Different designs, different paradigms 3: Memory
Memories
Memory accessing modes
Alignment
Endianness Different designs, different paradigms Explaination Demo
Caches
Gabriel Laskar (EPITA) CAAL 2015 73 / 378 Memory Endianness Different designs, different paradigms Endianness
A data string represented with multiple bytes must be stored in memory. Similarly to written language, these bytes may be written “left-to-right” or “right-to-left”.
I Big-Endian
I Little-Endian
I Other endian modes
Gabriel Laskar (EPITA) CAAL 2015 74 / 378 Memory Endianness Explaination 3: Memory
Memories
Memory accessing modes
Alignment
Endianness Different designs, different paradigms Explaination Demo
Caches
Gabriel Laskar (EPITA) CAAL 2015 75 / 378 N = 4810310 = bbe716 4 3 2 1 0 I With b = 10: N = 4 · 10 + 8 · 10 + 1 · 10 + 0 · 10 + 3 · 10 3 2 1 0 I With b = 16: N = 11 · 16 + 11 · 16 + 14 · 16 + 7 · 16
So logically we tend to count digits from LSB: right to left Digit no 4 3 2 1 0 Base 10 value 4 8 1 0 3 Base 16 value b b e 7
Memory Endianness Explaination Endianness Mathematical reference
With a base b, a natural N may be decomposed in digits dk . If we naturally write it: N = dndn−1...dk ...d1d0 n n−1 k 1 0 N = dn · b + dn−1 · b + ··· + dk · b + ··· + d1 · b + d0 · b
Gabriel Laskar (EPITA) CAAL 2015 76 / 378 So logically we tend to count digits from LSB: right to left Digit no 4 3 2 1 0 Base 10 value 4 8 1 0 3 Base 16 value b b e 7
Memory Endianness Explaination Endianness Mathematical reference
With a base b, a natural N may be decomposed in digits dk . If we naturally write it: N = dndn−1...dk ...d1d0 n n−1 k 1 0 N = dn · b + dn−1 · b + ··· + dk · b + ··· + d1 · b + d0 · b
N = 4810310 = bbe716 4 3 2 1 0 I With b = 10: N = 4 · 10 + 8 · 10 + 1 · 10 + 0 · 10 + 3 · 10 3 2 1 0 I With b = 16: N = 11 · 16 + 11 · 16 + 14 · 16 + 7 · 16
Gabriel Laskar (EPITA) CAAL 2015 76 / 378 Memory Endianness Explaination Endianness Mathematical reference
With a base b, a natural N may be decomposed in digits dk . If we naturally write it: N = dndn−1...dk ...d1d0 n n−1 k 1 0 N = dn · b + dn−1 · b + ··· + dk · b + ··· + d1 · b + d0 · b
N = 4810310 = bbe716 4 3 2 1 0 I With b = 10: N = 4 · 10 + 8 · 10 + 1 · 10 + 0 · 10 + 3 · 10 3 2 1 0 I With b = 16: N = 11 · 16 + 11 · 16 + 14 · 16 + 7 · 16
So logically we tend to count digits from LSB: right to left Digit no 4 3 2 1 0 Base 10 value 4 8 1 0 3 Base 16 value b b e 7 Gabriel Laskar (EPITA) CAAL 2015 76 / 378 Memory Endianness Explaination Memory representation
Usually, we like to represent memory in written order, the same way we write words on a paper sheet: left to right
0 1 2 3 0x00 ’A’ ’’ ’s’ ’i’ 0x04 ’m’ ’p’ ’l’ ’e’ 0x08 ’’ ’m’ ’e’ ’s’ 0x0c ’s’ ’a’ ’g’ ’e’
Gabriel Laskar (EPITA) CAAL 2015 77 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0
3 2 1 0
0x00 0001 02 03 0x04 1011 12 13 0x08 20 21 22 23 0x0c 30 31 32 33 0x10 40 41 42 43
Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0
3 2 1 0 20 21 22 23 Big endian
0x00 0001 02 03 0x04 1011 12 13 0x08 20 21 22 23 0x0c 30 31 32 33 0x10 40 41 42 43
Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0
3 2 1 0
23 22 21 20 Little endian
0x00 0001 02 03 0x04 1011 12 13 0x08 20 21 22 23 0x0c 30 31 32 33 0x10 40 41 42 43
Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0
3 2 1 0
23 22 21 20 Little endian
0x00 0x00 03 02 01 00 0001 02 03 0x04 0x04 13 12 11 10 1011 12 13 0x08 0x08 23 22 21 20 20 21 22 23 0x0c 0x0c 33 32 31 30 30 31 32 33 0x10 0x10 43 42 41 40 40 41 42 43
Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Integer memory representation The “endianness” problem is whether to write integers I in text order: the natural way (for western languages) I in index order: with digit 0 at address 0
3 2 1 0
Little endian 23 22 21 20
0x00 0x00 03 02 01 00 0001 02 03 0x04 0x04 13 12 11 10 1011 12 13 0x08 0x08 23 22 21 20 20 21 22 23 0x0c 0x0c 33 32 31 30 30 31 32 33 0x10 0x10 43 42 41 40 40 41 42 43
Gabriel Laskar (EPITA) CAAL 2015 78 / 378 Memory Endianness Explaination Endianness Do not mix everything!
I Data must be stored and fetched using the same convention.
I Don’t worry about byte order in registers, it does not make sense.
Gabriel Laskar (EPITA) CAAL 2015 79 / 378 Memory Endianness Demo 3: Memory
Memories
Memory accessing modes
Alignment
Endianness Different designs, different paradigms Explaination Demo
Caches
Gabriel Laskar (EPITA) CAAL 2015 80 / 378 Memory Endianness Demo Endianness Demo
int main(void) { unsigned int a= 0x12345678; hexdump(&a); }
Gabriel Laskar (EPITA) CAAL 2015 81 / 378 Memory Caches 3: Memory
Memories
Memory accessing modes
Alignment
Endianness
Caches
Gabriel Laskar (EPITA) CAAL 2015 82 / 378 Memory Caches Cache memory: reason
I Once upon a time, CPU and memory speed were the same. I Of course, they evolved:
I Moore’s law: CPU power doubles every 2 years, I Memory speed: +7% every year.
Gabriel Laskar (EPITA) CAAL 2015 83 / 378 Memory Caches Cache memory: definition
Cache memory is:
I a local copy of central memory,
I transparent, on-demand,
I volatile (may be flushed anytime),
I faster, closer to CPU than central memory,
I expensive!
Gabriel Laskar (EPITA) CAAL 2015 84 / 378 Memory Caches Cache memory Hierarchy
CPU CPU
L1 L1 L1 L1 Instr. Data Instr. Data
CPU−Shared L2 CPU−Shared L2 Ins + Data Ins + Data
Die−Shared L3 Ins + Data + TLB
I There are multiple caches “levels” I with different size, I with different speed, I with different latency. I Caches may be shared between code and data. Gabriel Laskar (EPITA) CAAL 2015 85 / 378 Memory Caches Cache memories side-effects
Caches may also involve stangeness:
I having copies of memory introduce a coherence problem,
I an access to a given memory location may vary,
There are various memory operators: Programmer that you handle in C code, Assembler that the compiler generated, CPU that the CPU does, In-cache that the cache does in the system.
Gabriel Laskar (EPITA) CAAL 2015 86 / 378 Memory mapping
Part IV
Memory mapping
Gabriel Laskar (EPITA) CAAL 2015 87 / 378 Memory mapping Address space 4: Memory mapping
Address space Definition Translation
Computer address spaces
Computer address space translation
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 88 / 378 Memory mapping Address space Definition 4: Memory mapping
Address space Definition Translation
Computer address spaces
Computer address space translation
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 89 / 378 Memory mapping Address space Definition Address space Definition
An address space is a set of discrete values targetting a set of objects.
I Each address points to one object
I One given object may be pointed by more than one address 1 foo 2 bar 3 baz
Gabriel Laskar (EPITA) CAAL 2015 90 / 378 Memory mapping Address space Translation 4: Memory mapping
Address space Definition Translation
Computer address spaces
Computer address space translation
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 91 / 378 Memory mapping Address space Translation Address space translation
Address space translation creates a new address space where each source address is mapped to a destination address. A given object can then be targetted by either addresses. a 1 foo b 2 bar c 3 baz
Gabriel Laskar (EPITA) CAAL 2015 92 / 378 Memory mapping Computer address spaces 4: Memory mapping
Address space
Computer address spaces Definition Usual address spaces
Computer address space translation
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 93 / 378 Memory mapping Computer address spaces Definition 4: Memory mapping
Address space
Computer address spaces Definition Usual address spaces
Computer address space translation
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 94 / 378 Memory mapping Computer address spaces Definition Memory address space Definition
A memory address space
I is contiguous
I can be mapped to another target address space (through address space translation) All memory accesses are done with respect to an address space.
Gabriel Laskar (EPITA) CAAL 2015 95 / 378 Memory mapping Computer address spaces Usual address spaces 4: Memory mapping
Address space
Computer address spaces Definition Usual address spaces
Computer address space translation
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 96 / 378 Memory mapping Computer address spaces Usual address spaces Computer address spaces
CPU Address space available through a CPU register. Current machines have 32 or 64 bit pointer registers Physical Address space actually wired between hardware components. Current machines have physical address spaces around 40 bits. Virtual Address space reachable by a process. Generally 32 or 64 bits.
Gabriel Laskar (EPITA) CAAL 2015 97 / 378 Memory mapping Computer address spaces Usual address spaces Physical memory
Physical memory provides the lowest accessible address space in computer. Physical RAM is usually accessible as a small memory subset of the physical address space. Physical memory can be mapped in different ways depending on physical address bus implementation:
I Accessible at a given location,
I Scattered at multiple locations,
I Accessible (many times) at multiple locations.
Gabriel Laskar (EPITA) CAAL 2015 98 / 378 Memory mapping Computer address space translation 4: Memory mapping
Address space
Computer address spaces
Computer address space translation Segmentation Pagination
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 99 / 378 Memory mapping Computer address space translation Segmentation 4: Memory mapping
Address space
Computer address spaces
Computer address space translation Segmentation Pagination
MMU patterns
Gabriel Laskar (EPITA) CAAL 2015 100 / 378 Memory mapping Computer address space translation Segmentation Memory segments
Memory segments define an address space as a sub-region of another address space. It is mostly defined by the following attributes
I Segment base address in target address space,
I Segment size,
I Segment type and access rights.
Gabriel Laskar (EPITA) CAAL 2015 101 / 378 Memory mapping Computer address space translation Segmentation Simple segment
Segmentation keeps memory locations contiguous.
0
Source address space
0
Destination address space
Gabriel Laskar (EPITA) CAAL 2015 102 / 378 Memory mapping Computer address space translation Segmentation Physical address space
0 Physical address space
0
Physical memory
Gabriel Laskar (EPITA) CAAL 2015 103 / 378 Memory mapping Computer address space translation Segmentation Single mapped RAM
0 Physical address space
addr[x 0 Physical memory Gabriel Laskar (EPITA) CAAL 2015 104 / 378 Memory mapping Computer address space translation Segmentation Single mapped RAM 0 Physical address space addr[x>=n] = ? 0 n Physical memory ? Gabriel Laskar (EPITA) CAAL 2015 104 / 378 Memory mapping Computer address space translation Segmentation Multiple/loop mapped RAM 0 Physical address space addr[x] = mem[x%n] 0 Physical memory Gabriel Laskar (EPITA) CAAL 2015 105 / 378 Memory mapping Computer address space translation Segmentation Memory access using segments To access memory through a defined segment, the CPU performs the following tasks: I Check requested address against the segment size, I Add the segment base address to the requested memory address, I Check access rights. Gabriel Laskar (EPITA) CAAL 2015 106 / 378 Memory mapping Computer address space translation Segmentation Memory access using segments Memory access 0 Process address space 0 Target address space Segment base Access address offset Effective address Gabriel Laskar (EPITA) CAAL 2015 107 / 378 Memory mapping Computer address space translation Segmentation Segment descriptor table 0 Segment 0 CPU 0 Segment 2 Register 1 0 Segment 3 0 Target base size address 0 12 space − − 6 3 9 3 Gabriel Laskar (EPITA) CAAL 2015 108 / 378 Memory mapping Computer address space translation Segmentation Limitations of segmentation Segmentation is a great thing, but is has a few limitations: I Address-space must be mapped in contiguous blocks I Thus segments are difficult to grow on-demand I A whole segment must be present in the target address space Gabriel Laskar (EPITA) CAAL 2015 109 / 378 Memory mapping Computer address space translation Pagination 4: Memory mapping Address space Computer address spaces Computer address space translation Segmentation Pagination MMU patterns Gabriel Laskar (EPITA) CAAL 2015 110 / 378 Memory mapping Computer address space translation Pagination Memory pages Modern operating systems need more than segmentation. Basic idea is to split the source address space in pages, and map each source page to a target page. Source address 0 space 0 Target address space Gabriel Laskar (EPITA) CAAL 2015 111 / 378 Memory mapping Computer address space translation Pagination Pages mapping Memory pages need to be mapped using a specific descriptor table recognized by the CPU. Usual attributes for a page are I address in target address space, I type and access rights, I page size, I other attributes (cacheability, coherence, ...) Gabriel Laskar (EPITA) CAAL 2015 112 / 378 Memory mapping Computer address space translation Pagination Pages descriptor table Source address 0 space CPU Register 1 0 Target address − space 5 3 6 10 − 4 − − Gabriel Laskar (EPITA) CAAL 2015 113 / 378 Memory mapping Computer address space translation Pagination Pages mapping Rationale Splitting memory in pages allows more powerful memory management: I Address spaces may be mapped to uncontiguous target pages, allowing memory fragmentation. I Many interesting operations can be performed on pages (sharing, swapping, ...) Gabriel Laskar (EPITA) CAAL 2015 114 / 378 Memory mapping MMU patterns 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 115 / 378 Memory mapping MMU patterns Memory protection 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 116 / 378 Memory mapping MMU patterns Memory protection Memory protection Motivations Why do we need memory protection? Gabriel Laskar (EPITA) CAAL 2015 117 / 378 Memory mapping MMU patterns Memory protection Memory protection In order to execute secure operating system, hardware has to provide memory protection systems. This protects I the system from the hosted processes I hosted processes from each other It may even protect the system from its own components in a Micro-Kernel approach. Gabriel Laskar (EPITA) CAAL 2015 119 / 378 Memory mapping MMU patterns Memory protection Memory protection How? Several memory access checks are performed by the CPU, for each access, transparently: I address bounds validity, I privilege level, I operation type. Gabriel Laskar (EPITA) CAAL 2015 120 / 378 Memory mapping MMU patterns Memory protection Memory protection Where? I In a segmentation-based system, priviliges are per segment. I In a pagination-based system, privileges are in page descriptor table, with a page granularity. x86 mixes both with segmentation on top of pagination. Gabriel Laskar (EPITA) CAAL 2015 121 / 378 Memory mapping MMU patterns Privileges 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 122 / 378 Memory mapping MMU patterns Privileges Privilege levels Privilege level keeps memory from being accessed by non-authorized code. Different CPUs define different privilege levels. Instruction Illegal Allowed memory memory access access Code Data Data segment/page segment/page segment/page level 1 level 0 level 1 Target address space Gabriel Laskar (EPITA) CAAL 2015 123 / 378 Memory mapping MMU patterns Privileges Operation type Operation types checking keeps code from doing unwanted memory operations. Instruction Execution Illegal Illegal Illegal Read Write Read ok write read write ok ok ok Code Data Data segment/page segment/page segment/page (exec) (read) (read/write) Target address space Gabriel Laskar (EPITA) CAAL 2015 124 / 378 Memory mapping MMU patterns Process switching 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 125 / 378 Memory mapping MMU patterns Process switching Process switching Pagination systems are easier to use with a separate memory address space for each process running on a computer: I Switching process space implies changing the page descriptor table. I Each process has its own page descriptor table ready to be used by the CPU. I Only the CPU register pointing to this table has to be changed to setup a new address space. Gabriel Laskar (EPITA) CAAL 2015 126 / 378 Memory mapping MMU patterns Process switching Process switching Process A Process B address space address space Target address space Process A Process B page table page table CPU register Gabriel Laskar (EPITA) CAAL 2015 127 / 378 Memory mapping MMU patterns Memory sharing 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 128 / 378 Memory mapping MMU patterns Memory sharing Memory sharing The memory pagination can be used to share pages between processes. A single page can be mapped in several process address spaces to permit different behaviors: I Save physical memory by not duplicating shared code and read-only data memory (used for shared libraries), I Use shared memory for inter-process communication, Gabriel Laskar (EPITA) CAAL 2015 129 / 378 Memory mapping MMU patterns Memory sharing Memory sharing Process A Process B address space address space 0 1 2 3 0 1 2 0 1 2 3 4 ...... 18 Target address space Process A Process B page table page table Gabriel Laskar (EPITA) CAAL 2015 130 / 378 Memory mapping MMU patterns Copy On Write 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 131 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Copy On Write (COW) is a powerful trick used in many situations. One of the most usual ones is fork() where the whole address space of a process has to be cloned. Basic idea is to make the whole memory read-only and actually copy only when necessary, as late as possible. Gabriel Laskar (EPITA) CAAL 2015 132 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 − − 3 6 10 − 4 − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 − − 3 6 10 − 4 − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 − − − − 3 3 6 6 10 10 − − 4 4 − − − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 0 − − − − 3 3 6 6 10 10 − − 4 4 − − − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 0 − − − − 3 3 6 6 10 10 − − 4 4 − − − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 0 − − − − 3 3 6 6 10 10 − − 4 4 − − − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 0 − − − − 3 3 6 6 10 10 − − 2 4 − − − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Copy On Write Copy On Write Step by step 0 0 0 − − − − 3 3 6 6 10 10 − − 2 4 − − − − Gabriel Laskar (EPITA) CAAL 2015 133 / 378 Memory mapping MMU patterns Page swapping 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 134 / 378 Memory mapping MMU patterns Page swapping Page swapping Page swapping is a mechanism artificially enlarging Physical memory with a part of the hard-disk. Gabriel Laskar (EPITA) CAAL 2015 135 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 6 10 − 4 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 6 10 − 4 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 6 10 − 4 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 6 10 − 4 S: 3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 6 10 − 4 S: 3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 disk 10 − 4 S: 3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 disk 10 − 4 S: 3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 disk 10 − 4 S: 3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 7 10 − 4 S: 3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns Page swapping Page swapping Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 7 10 − 4 S: 3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 136 / 378 Memory mapping MMU patterns mmap() 4: Memory mapping Address space Computer address spaces Computer address space translation MMU patterns Memory protection Privileges Process switching Memory sharing Copy On Write Page swapping mmap() Gabriel Laskar (EPITA) CAAL 2015 137 / 378 Memory mapping MMU patterns mmap() mmap() Make part of the memory exactly match the contents of a file. I Reflect changes immediatly between processes having file opened I Permit different protections on different parts of the file I Lazily load parts of file I Lazily write parts back Gabriel Laskar (EPITA) CAAL 2015 138 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 space CPU Register 1 0 Target address − space 5 3 − − − − − Disk − Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space CPU Register 1 0 Target address − space 5 3 f:0 f:1 f:2 f:3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space CPU Register 1 0 Target address − space 5 3 f:0 f:1 f:2 f:3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space CPU Register 1 0 Target address − space 5 3 foo.bin:3 f:0 f:1 f:2 f:3 − Disk − Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space CPU Register 1 0 Target address − space 5 3 f:0 f:1 f:2 8 − Disk − Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Memory mapping MMU patterns mmap() mmap() Step by step Source address 0 foo.bin space CPU Register 1 0 Target address − space 5 3 f:0 f:1 f:2 8 − Disk − Gabriel Laskar (EPITA) CAAL 2015 139 / 378 Execution flow Part V Execution flow Gabriel Laskar (EPITA) CAAL 2015 140 / 378 Execution flow Branch, function calls 5: Execution flow Branch, function calls Branch principle Pipeline considerations Function calls Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 141 / 378 Execution flow Branch, function calls Branch principle 5: Execution flow Branch, function calls Branch principle Pipeline considerations Function calls Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 142 / 378 Execution flow Branch, function calls Branch principle Branch principle Branching is breaking the normal incremental execution flow to go execute code somewhere else. A branch has I a destination, I optionally a condition, I optionally a “link” feature, saving return point. Gabriel Laskar (EPITA) CAAL 2015 143 / 378 Execution flow Branch, function calls Branch principle Branch offset Instructions are fetched from memory to CPU by dereferencing the program counter (%pc register) I in normal execution flow, the %pc is auto-incremented Addresses Instructions 80450010 add %g1, %g2, %g3 80450014 mov %g0, %g1 80450018 xor %g2, %g3, %g2 Executed instruction %pc 8045001C sub %g3, %g2, %g1 Next instruction 80450020 ... I when a branch occurs, the %pc is affected in two possible ways: relative branch: a constant offset is added to the %pc absolute branch: an absolute address is loaded into the %pc Gabriel Laskar (EPITA) CAAL 2015 144 / 378 Execution flow Branch, function calls Branch principle Unconditional branch The %pc register is always modified. I Explicit jump: goto I Explicit infinite loop, optimized by the compiler Addresses Instructions 80450010 b 8045011C Executed instruction 80450014 mov %g0, %g1 80450118 xor %g2, %g3, %g2 %pc 8045011C sub %g3, %g2, %g1 Next instruction 80450120 ... Gabriel Laskar (EPITA) CAAL 2015 145 / 378 Execution flow Branch, function calls Branch principle Unconditional branch C listing #include void main(void) { do { puts("hello"); } while (1); } Gabriel Laskar (EPITA) CAAL 2015 146 / 378 Execution flow Branch, function calls Branch principle Unconditional branch Control flow graph main puts("hello"); return Gabriel Laskar (EPITA) CAAL 2015 147 / 378 Execution flow Branch, function calls Branch principle Conditional branch The %pc register is modified only if the condition is verified I if statements I loop (for, while) statements Addresses Instructions 80450010 bz 8045011C Executed instruction 80450014 mov %g0, %g1 Potential next instruction 80450018 ... %pc 80450118 xor %g2, %g3, %g2 8045011C sub %g3, %g2, %g1 Potential next instruction 80450120 ... Gabriel Laskar (EPITA) CAAL 2015 148 / 378 Execution flow Branch, function calls Branch principle Conditional branch C listing #include void main(void) { int i= 10; do { printf("%i\n",--i); } while (i !=0); } Gabriel Laskar (EPITA) CAAL 2015 149 / 378 Execution flow Branch, function calls Branch principle Conditional branch Control flow graph main i = 10 −−i printf("%i\n"); yes i != 0 no return Gabriel Laskar (EPITA) CAAL 2015 150 / 378 Execution flow Branch, function calls Branch principle Conditions The decision to take the branch is based on register contents: I conditional branch may occur if a specific bit is set in a register I conditional branch may occur if a specific bit is clear in a register I conditional branch may occur if a register equals a specific value (usually zero) Gabriel Laskar (EPITA) CAAL 2015 151 / 378 Execution flow Branch, function calls Branch principle Complete examples 1. strlen 2. pgcd Gabriel Laskar (EPITA) CAAL 2015 152 / 378 Execution flow Branch, function calls Pipeline considerations 5: Execution flow Branch, function calls Branch principle Pipeline considerations Function calls Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 153 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations A branch may break the pipeline, slowing the execution 1. when things goes wrong, a bubble is created 2. sometimes, the processor succeeds in predicting the branch target 3. when a misprediction occurs, a bubble is created Gabriel Laskar (EPITA) CAAL 2015 154 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions: ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions: sub Fetch Decod ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions: sub Fetch Decod Exec not Fetch Decod ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions: sub Fetch Decod Exec Mem not Fetch Decod Exec add Fetch Decod ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions: sub Fetch Decod Exec Mem Write not Fetch Decod Exec Mem add Fetch Decod Exec jmp Fetch Decod ! Instruction sequence ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions: sub Fetch Decod Exec Mem Write not Fetch Decod Exec Mem Write add Fetch Decod Exec Mem jmp Fetch Decod Exec Instruction sequence −− Fetch −− ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Bubble When a branch occurs, the processor may flush the stages of the pipeline that contains useless instructions: sub Fetch Decod Exec Mem Write not Fetch Decod Exec Mem Write add Fetch Decod Exec Mem Write jmp Fetch Decod Exec Mem Instruction sequence −− Fetch −− −− add Fetch Decod ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 155 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Delay slot A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed. sub Fetch Decod Exec Mem Write not Fetch Decod Exec Mem add Fetch Decod Exec jmp Fetch Decod Instruction sequence ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Delay slot A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed. sub Fetch Decod Exec Mem Write not Fetch Decod Exec Mem add Fetch Decod Exec jmp Fetch Decod Instruction sequence instruction following ? jmp is executed Fetch execution flow is modified one cycle later Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Delay slot A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed. sub Fetch Decod Exec Mem Write not Fetch Decod Exec Mem Write add Fetch Decod Exec Mem jmp Fetch Decod Exec Instruction sequence instruction following sub jmp is executed Fetch Decod execution flow ? is modified one cycle later Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Delay slot A delay slot is a processor feature. It’s a convention saying the instruction following branches is always executed. Branch is delayed. sub Fetch Decod Exec Mem Write not Fetch Decod Exec Mem Write add Fetch Decod Exec Mem Write jmp Fetch Decod Exec Mem Instruction sequence instruction following sub jmp is executed Fetch Decod Exec execution flow add is modified one cycle later Fetch Decod ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 156 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec ld g5, [g2] Fetch Decod ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec Mem ld g5, [g2] Fetch Decod Exec ! add g6, 0x2a, g5 Fetch Decod ? Fetch Instruction sequence Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec Mem Write ld g5, [g2] Fetch Decod Exec Mem ! add g6, 0x2a, g5 Fetch Decod Exec Fetch Decod Instruction sequence Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec Mem Write ld g5, [g2] Fetch Decod Exec Mem ! add g6, 0x2a, g5 Fetch Decod Exec Fetch Decod Instruction sequence Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec Mem Write ld g5, [g2] Fetch Decod Exec Mem ! add g6, 0x2a, g5 Fetch Decod Exec add g6, 0x2a, g5 Fetch Decod Instruction sequence Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec Mem Write ld g5, [g2] Fetch Decod Exec Mem Write add g6, 0x2a, g5 Fetch Decod (Exec) −− add g6, 0x2a, g5 Fetch (Decod) Exec Instruction sequence ld i3, [g6 + 0x8] (Fetch) Decod ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec Mem Write ld g5, [g2] Fetch Decod Exec Mem Write add g6, 0x2a, g5 Fetch Decod (Exec) −− add g6, 0x2a, g5 Fetch (Decod) Exec Instruction sequence ld i3, [g6 + 0x8] (Fetch) Decod ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Branch, function calls Pipeline considerations Pipeline considerations Loads (digression) sub g1, g2, g3 Fetch Decod Exec Mem Write ld g5, [g2] Fetch Decod Exec Mem Write add g6, 0x2a, g5 Fetch Decod (Exec) −− −− Fetch (Decod) Exec Mem Instruction sequence ld i3, [g6 + 0x8] (Fetch) Decod Exec xor i4, 0x2a, i3 Fetch Decod ? Fetch Pipeline timeline Gabriel Laskar (EPITA) CAAL 2015 157 / 378 Execution flow Function calls 5: Execution flow Branch, function calls Function calls Principles Argument passing Call conventions Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 158 / 378 Execution flow Function calls Principles 5: Execution flow Branch, function calls Function calls Principles Argument passing Call conventions Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 159 / 378 Execution flow Function calls Principles Function calls principle A function is a piece of code that returns a value depending on the parameters the user specifies as inputs, if any. Addresses Instructions Instructions Addresses 8004500 8002000 8004504 8002004 8004508 8002008 800450B Gabriel Laskar (EPITA) CAAL 2015 160 / 378 Execution flow Function calls Principles “call” instruction I Saves next instruction address (the return path), I Jumps to function Addresses Instructions Return address 80450018 %i7 80450010 call 80002000 Executed instruction 80450014 mov %g0, %g1 Delay slot instruction 80450018 xor %g2, %g1, %g3 %pc 80002000 add %g1, %g2, %g3 Next instruction 80002004 ... Gabriel Laskar (EPITA) CAAL 2015 161 / 378 Execution flow Function calls Principles “ret” instruction I Stores the result in a dedicated register I Restores the previously-saved PC address Addresses Instructions Return address 800020F8 ret Executed instruction %i7 800020FC not %g1, %g2 Delay slot instruction 80002100 ... 80450018 80450010 call 80002000 80450014 mov %g0, %g1 %pc 80450018 xor %g2, %g1, %g3 Next instruction Gabriel Laskar (EPITA) CAAL 2015 162 / 378 Execution flow Function calls Argument passing 5: Execution flow Branch, function calls Function calls Principles Argument passing Call conventions Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 163 / 378 Execution flow Function calls Argument passing How to pass arguments and return values? We need a space of shared data between the caller and the callee. I From caller to callee, for arguments I From callee to caller, for return values Some arguments are purely for execution purposes: I context pointers (stack, globals, ...) I return address Gabriel Laskar (EPITA) CAAL 2015 164 / 378 Execution flow Function calls Argument passing Simplest case I Non-nested function call I Less arguments than available machine registers Gabriel Laskar (EPITA) CAAL 2015 165 / 378 Execution flow Function calls Argument passing Through registers On most RISC architectures, there are registers dedicated to argument storage: I Each argument is stored and preserved directly in a register I Local variables may be held in another set of registers Example: on SPARC architecture, the CPU has: I 8 registers (%g0, ... %g7) dedicated to global variables I 8 registers (%i0, ... %i7) dedicated to input arguments I 8 registers (%l0, ... %l7) dedicated to local variables I 8 registers (%o0, ... %o7) dedicated to output arguments A callee’s “input” registers are shared with caller’s “output” registers. Gabriel Laskar (EPITA) CAAL 2015 166 / 378 Execution flow Function calls Argument passing Through global memory Principle: I Each argument is stored and preserved directly in memory I Each local variable is held in memory Gabriel Laskar (EPITA) CAAL 2015 167 / 378 Execution flow Function calls Call conventions 5: Execution flow Branch, function calls Function calls Principles Argument passing Call conventions Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 168 / 378 Execution flow Function calls Call conventions Call conventions I How to deal with huge amount of variables and arguments? I How to deal with recursive calls and nested function calls? I How to deal with variables bigger than registers? I How to deal with “...” (like printf)? I How to deal with dedicated registers (floats)? Gabriel Laskar (EPITA) CAAL 2015 169 / 378 Execution flow Function calls Call conventions Problem Consider a recursive function that needs local variables: I Each time the function is called, a new context must be allocated to preserve each local variable I Each time the function returns, the previous context must be restored I The function may call itself an unpredictable number of times I These local variables cannot be reserved in a static memory space (as global variables are). Gabriel Laskar (EPITA) CAAL 2015 170 / 378 Execution flow Function calls Call conventions Abstract context stack Function calls void foo(int a) { return foo(a + 1); } Local context Local context Local context Function returns Gabriel Laskar (EPITA) CAAL 2015 171 / 378 Execution flow Function calls Call conventions Register window Hardware context stack implementation: I Uses a large amount of registers (example: 512 on Sparc) I Uses a logical limitation to define multiple contexts I Does context change by sliding a window on each function call Gabriel Laskar (EPITA) CAAL 2015 172 / 378 Execution flow Function calls Call conventions Principle Abstract stack layout Hardware registers Sliding register window Function calls %R120 %R119 Next Local %R118 context %R117 %R116 %o0 − %o7 %R115 Current Local %l0 − %l7 context %R102 %R101 %i0 − %i7 %R100 Previous Local %R99 context %R98 %R97 %R96 %R95 Function returns Gabriel Laskar (EPITA) CAAL 2015 173 / 378 Execution flow Function calls Call conventions Sparc overlapping register window Function call Input registers %i0 ... %i7 register Local registers window %l0 ... %l7 Called fonctions context Output registers Register Input registers %o0 ... %o7 overlap %i0 ... %i7 Calling fonctions context Local registers register %l0 ... %l7 window Output registers %o0 ... %o7 Gabriel Laskar (EPITA) CAAL 2015 174 / 378 Execution flow Function calls Call conventions Function prologue and epilogue prologue: slide register window Example: save %sp,-96, %sp epilogue: slide back register window (i.e. restore previous context). Example: restore Gabriel Laskar (EPITA) CAAL 2015 175 / 378 Execution flow Function calls Call conventions Memory stack Register window has hard limitations: I The call depth is limited to the total amount of registers I The CPU needs a large amount of physical registers ⇒ expensive On most systems, memory is used to implement a cheaper stack. Gabriel Laskar (EPITA) CAAL 2015 176 / 378 Execution flow Function calls Call conventions Principle Abstract stack layout Sliding stack frame Process memory Function calls Called function Stack Next arguments memory Local space context Stack pointer Local Current variables Current stack Local frame context Frame Return address pointer Input Previous arguments Local context Function returns Gabriel Laskar (EPITA) CAAL 2015 177 / 378 Execution flow Function calls Call conventions Nested function calls Stack memory space Function calls Local variables Return address Input Called function arguments arguments Stack pointer Local variables Current function Return address Frame context pointer Called function Input arguments arguments Local variables Return address Gabriel Laskar (EPITA) CAAL 2015 178 / 378 Execution flow Function calls Call conventions Function prologue and epilogue prologue: saves previous frame pointer, set new frame pointer, reserve space on memory stack for local variables Example: a function that needs three 32 bits local variables (12 bytes) on its stack: [%sp] <- %fp %fp <- %sp %sp <- %sp- 12 epilogue: restores previous frame and stack pointers (i.e. restore previous context). Example: %sp <- %fp %fp <-[%fp] Gabriel Laskar (EPITA) CAAL 2015 179 / 378 Execution flow Function calls Call conventions Argument and local variable access argument: Dereference the address “frame pointer + argument offset”: ld[%fp+ 16], %g1 ; Access to arg_a local variable: Dereference the address “frame pointer - local variable offset” ld[%fp-4], %g1 ; Access to var_b Gabriel Laskar (EPITA) CAAL 2015 180 / 378 Execution flow Function calls Call conventions Argument and local variable access schema Stack pointer int var_c local int var_b Frame variables pointer int var_a previous fp void foo(int arg_a, int arg_b) ret address { int arg_b function int var_a, var_, var_c; int arg_a arguments ... } previous fp ret address Gabriel Laskar (EPITA) CAAL 2015 181 / 378 Execution flow Handling events 5: Execution flow Branch, function calls Function calls Handling events Events System calls Faults Hardware interruptions Multitasking Gabriel Laskar (EPITA) CAAL 2015 182 / 378 Execution flow Handling events Events 5: Execution flow Branch, function calls Function calls Handling events Events System calls Faults Hardware interruptions Multitasking Gabriel Laskar (EPITA) CAAL 2015 183 / 378 Execution flow Handling events Events Events What are events? How to handle them? Gabriel Laskar (EPITA) CAAL 2015 184 / 378 Execution flow Handling events Events Categorizing events An interrupt is caused by an external event. An exception is caused by instruction execution. Unplanned Deliberate Synchronous fault syscall trap, software interrupt Asynchronous hardware interruption → An incoming event must be executed with a high priority. Gabriel Laskar (EPITA) CAAL 2015 185 / 378 Execution flow Handling events Events User space / kernel space A trap is a critical event that must be handled by the kernel in a safe memory space, the kernel space. User mode Kernel mode Kernel code Kernel Invalid memory Kernel space stack User code User code User data User data Process memory space Usr.stack Usr.stack Gabriel Laskar (EPITA) CAAL 2015 186 / 378 Execution flow Handling events Events Trap handler table The processor jumps to a trap routine defined by the operating system. The trap routine address is stored in a dedicated descriptor table. Kernel mode Kernel code Interruption table Kernel Interruption data table address User code register User data Usr.stack trap #4 Gabriel Laskar (EPITA) CAAL 2015 187 / 378 Execution flow Handling events System calls 5: Execution flow Branch, function calls Function calls Handling events Events System calls Faults Hardware interruptions Multitasking Gabriel Laskar (EPITA) CAAL 2015 188 / 378 Execution flow Handling events System calls System calls The system call mechanism can be used by a process to request services from the operating system: I executing a process, exiting I reading input, writing output (read, write...) I performing restricted actions such as accessing hardware devices or accessing the memory management unit. I etc, ... Gabriel Laskar (EPITA) CAAL 2015 189 / 378 Execution flow Handling events System calls Processor modes Hardware facilities have been implemented to ensure process isolation in multi-user/multi-processor environment. Today, processors have two (or more) execution modes: user mode dedicated to user applications supervisor mode reserved for operating system kernel Gabriel Laskar (EPITA) CAAL 2015 190 / 378 Execution flow Handling events System calls Execution permissions For security sake, some operations are restricted to kernel space: I peripheral input and output I low-level memory management System calls allow user to switch to kernel space and perform restricted actions, under certain conditions. The kernel must check system call parameters validity. Gabriel Laskar (EPITA) CAAL 2015 191 / 378 Execution flow Handling events System calls System call implementation 1. System calls use a dedicated instruction which causes the processor to change mode to superuser (or protected) mode. 2. Each system call is indexed by a single number: the syscall trap handler makes an indirect call through the system call dispatch table to the handler for the specific system call. Gabriel Laskar (EPITA) CAAL 2015 192 / 378 Execution flow Handling events System calls Execution path System calls run in kernel mode on the kernel memory space. User mode Kernel mode User system code call context switch Kernel code system return context switch User code Gabriel Laskar (EPITA) CAAL 2015 193 / 378 Execution flow Handling events System calls libc example Standart C library’s read, write, pipe, ... functions are wrappers to corresponding system calls: _SYSENTRY(pipe) mov %o0, %o2 mov SYS_pipe, %g1 ta %xcc, ST_SYSCALL bcc,a,pt %xcc,1f stw %o0,[%o2] ERROR() 1: stw %o1,[%o2+4] retl clr %o0 _SYSEND(pipe) Gabriel Laskar (EPITA) CAAL 2015 194 / 378 Execution flow Handling events Faults 5: Execution flow Branch, function calls Function calls Handling events Events System calls Faults Hardware interruptions Multitasking Gabriel Laskar (EPITA) CAAL 2015 195 / 378 Execution flow Handling events Faults Faults How to handle divide by zero? How to handle overflow? How to handle ... Gabriel Laskar (EPITA) CAAL 2015 196 / 378 Execution flow Handling events Faults Similarities with system calls Faults are similar to system calls in some respects: I Faults occur as a result of a process executing an instruction I The kernel exception handler may return to the faulty user context But faults are different in other respects: I Syscalls are deliberate, faults are unexpected I Not every execution of the instruction results in a fault Gabriel Laskar (EPITA) CAAL 2015 197 / 378 Execution flow Handling events Faults Handling a fault Different actions may be taken by the operating system in response to faults: I kill the user process I notify the process that a fault occurred (so it may recover in its own way) I solve the problem and resume the process transparently Gabriel Laskar (EPITA) CAAL 2015 198 / 378 Execution flow Handling events Faults Execution path User mode Kernel mode Instruction User fault code context save Kernel code context restore User process code exit Gabriel Laskar (EPITA) CAAL 2015 199 / 378 Execution flow Handling events Faults Events interface Unix systems can notify a user program of a fault with a signal. Signals are also used for other forms of asynchronous event notifications. Gabriel Laskar (EPITA) CAAL 2015 200 / 378 Execution flow Handling events Hardware interruptions 5: Execution flow Branch, function calls Function calls Handling events Events System calls Faults Hardware interruptions Multitasking Gabriel Laskar (EPITA) CAAL 2015 201 / 378 Execution flow Handling events Hardware interruptions Hardware interruptions How to handle devices interruptions? Gabriel Laskar (EPITA) CAAL 2015 202 / 378 Execution flow Handling events Hardware interruptions Execution path User mode Kernel mode User hardware code interruption context save Kernel code interruption return context load User code Gabriel Laskar (EPITA) CAAL 2015 203 / 378 Execution flow Multitasking 5: Execution flow Branch, function calls Function calls Handling events Multitasking Gabriel Laskar (EPITA) CAAL 2015 204 / 378 Execution flow Multitasking Multitasking A single process may not use system resources at full capacity. The idea of multitasking is to simulate the execution of concurrent execution of many processes, using a single processor. The operating system has to: I create and delete processes, I organize processes in memory, I schedule processes for CPU use. Gabriel Laskar (EPITA) CAAL 2015 205 / 378 Execution flow Multitasking Memory mapping A process addressing B process addressing space space Pysical memory space A process B process memory map memory map Page table address register Gabriel Laskar (EPITA) CAAL 2015 206 / 378 Execution flow Multitasking Context switching Memory contexts Real backup CPU registers A Reg 1 process Reg 2 registers B process FlagReg registers PageReg Gabriel Laskar (EPITA) CAAL 2015 207 / 378 Execution flow Multitasking Process life Timer interruption context load context save Kernel Process A Process B Process C Gabriel Laskar (EPITA) CAAL 2015 208 / 378 This approach requires a way of handling events. Object file formats Part VI Object file formats Gabriel Laskar (EPITA) CAAL 2015 209 / 378 Object file formats Build process 6: Object file formats Build process Overview Development tools Analysis tools Binary formats Executable code loading Gabriel Laskar (EPITA) CAAL 2015 210 / 378 Object file formats Build process Overview 6: Object file formats Build process Overview Development tools Analysis tools Binary formats Executable code loading Gabriel Laskar (EPITA) CAAL 2015 211 / 378 Object file formats Build process Overview The big picture C source pure C Assembly Object code code code file cpp cc1 as .c .i .s .o .h Header files ld a.out .o .o Gabriel Laskar (EPITA) CAAL 2015 212 / 378 Object file formats Build process Overview Preprocessor C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o Header files I Also called macro-processor I Transforms a text (source) file into another text (source) file: I merges many files into one (#include) I replaces macros by their definitions (#define) I removes code considering conditions (#if*) Example: cpp I Input: C source file with directives I Output: “pure” C source file Gabriel Laskar (EPITA) CAAL 2015 213 / 378 Object file formats Build process Overview C compiler C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o Header files I Scans and parses the source file I Analyzes the source (type checking) I Generates assembly code (another source file) for the target architecture Gabriel Laskar (EPITA) CAAL 2015 214 / 378 Object file formats Build process Overview Assembler C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o Header files I Translates instructions in binary opcode sequence for the target architecture I Collects symbols and resolve label addresses I Writes an object file The assembler source file will be different depending on architecture! Gabriel Laskar (EPITA) CAAL 2015 215 / 378 Object file formats Build process Overview Linker C source Pure C Assembly Object Executable code code code file cpp cc1 as ld .c .i .s .o Header files I Merges (many) object files I Resolves external symbols I Computes addresses I Writes an executable file Gabriel Laskar (EPITA) CAAL 2015 216 / 378 Object file formats Build process Development tools 6: Object file formats Build process Overview Development tools Analysis tools Binary formats Executable code loading Gabriel Laskar (EPITA) CAAL 2015 217 / 378 Object file formats Build process Development tools Development tools I SPARC assembler: use aasm I Project at http://savannah.nongnu.org/projects/aasm I Sparc, Mips, PPC, ARM, ... architecture emulators I SoCLib: https://www.soclib.fr I QEmu Gabriel Laskar (EPITA) CAAL 2015 218 / 378 Object file formats Build process Analysis tools 6: Object file formats Build process Overview Development tools Analysis tools Binary formats Executable code loading Gabriel Laskar (EPITA) CAAL 2015 219 / 378 Object file formats Build process Analysis tools objdump I Displays object (executable) file tables, on host architecture I Disassembles object (executable) file code-sections I Useful options: I -h: display the section headers I -t: display the symbol tables I -dx: disassemble Gabriel Laskar (EPITA) CAAL 2015 220 / 378 Object file formats Build process Analysis tools readelf I Displays ELF object (executable) file tables, on any architecture Gabriel Laskar (EPITA) CAAL 2015 221 / 378 Object file formats Binary formats 6: Object file formats Build process Binary formats Simple binary formats How about code reuse and splitting Linking File formats history Executable code loading Gabriel Laskar (EPITA) CAAL 2015 222 / 378 Object file formats Binary formats Simple binary formats 6: Object file formats Build process Binary formats Simple binary formats How about code reuse and splitting Linking File formats history Executable code loading Gabriel Laskar (EPITA) CAAL 2015 223 / 378 Object file formats Binary formats Simple binary formats Simple binary formats Consider an object program that does not need other object programs to build a binary program: I the assembler collects code markers (labels) I the assembler resolves all labels and replaces all of them in the code Gabriel Laskar (EPITA) CAAL 2015 224 / 378 Object file formats Binary formats Simple binary formats Flat program The labels are immediately replaced and written by the assembler in the binary file: Address Opcodes Source code 00000000 90 nop 00000001 B8 78 56 34 12 mov eax, 0x12345678 00000006 E8 00 12 00 00 call my_function ... my_function: 00001200 90 nop 00001201 ... Gabriel Laskar (EPITA) CAAL 2015 225 / 378 Object file formats Binary formats Simple binary formats Flat binary format In raw binary format, the whole program is written directly in file. Useful at boot time: the processor can only read raw code and there is no binary loader. Gabriel Laskar (EPITA) CAAL 2015 226 / 378 Object file formats Binary formats How about code reuse and splitting 6: Object file formats Build process Binary formats Simple binary formats How about code reuse and splitting Linking File formats history Executable code loading Gabriel Laskar (EPITA) CAAL 2015 227 / 378 Object file formats Binary formats How about code reuse and splitting Challenges What happens when a function must be imported? What happens when a function must be exported? Gabriel Laskar (EPITA) CAAL 2015 228 / 378 Object file formats Binary formats How about code reuse and splitting Complex program When the symbol of a called function is unresolved, the assembler leaves the call destination address undefined: Address Opcodes Source code 00000000 90 nop 00000001 B8 78 56 34 12 mov eax, 0x12345678 00000006 E8 00 12 00 00 call my_function 00000006 E8 00 12 00 00 call printf ... External my_function: reference 00001200 90 nop 00001201 ... ⇒ The binary format must hold information on created “holes” Gabriel Laskar (EPITA) CAAL 2015 229 / 378 Object file formats Binary formats How about code reuse and splitting Needs To fill the holes left by the assembler later on, the binary file must hold: I A symbol table, which stores each label identifier, I A relocation table, which shows all the remaining “holes” I Labels associated to each “hole”. More generaly, a binary file format must also hold: I A header containing general informations needed to access various parts of the file, I Several sections holding code and data (raw data). Gabriel Laskar (EPITA) CAAL 2015 230 / 378 Object file formats Binary formats How about code reuse and splitting Abstraction of a binary file format file header .text .rodata .data .bss (not in file) section table reloc. table symbol table string table object file headers object file sections (Information regarding sections, symbol table etc. are usually identified within the file header) Gabriel Laskar (EPITA) CAAL 2015 231 / 378 Object file formats Binary formats How about code reuse and splitting The symbol table The symbol table associates each symbol identifier to the code (or data) address it represents (when the address has been resolved): symbol table 0000 E8 00 01 00 00 call my_func "printf" 0005 E8 ?? ?? ?? ?? call printf "my_func" ... my_func: "fopen" 0100 E8 ?? ?? ?? ?? call fopen 0105 90 nop 0106 E8 ?? ?? ?? ?? call printf Gabriel Laskar (EPITA) CAAL 2015 232 / 378 Object file formats Binary formats How about code reuse and splitting The relocation table The relocation table associates each “hole” address to a label identifier address: symbol table relocation table printf 0000 E8 00 01 00 00 call my_func "printf" 0006 0005 E8 ?? ?? ?? ?? call printf "my_func" printf ... 0107 my_func: "fopen" fopen 0101 0100 E8 ?? ?? ?? ?? call fopen 0105 90 nop 0106 E8 ?? ?? ?? ?? call printf Gabriel Laskar (EPITA) CAAL 2015 233 / 378 Object file formats Binary formats How about code reuse and splitting BSS regions Instead of keeping a bunch of empty bytes in the binary file for big static variables (arrays), the header can specify a whole region which must be filled with zero. BSS regions makes the file smaller and faster to load. Gabriel Laskar (EPITA) CAAL 2015 234 / 378 Object file formats Binary formats Linking 6: Object file formats Build process Binary formats Simple binary formats How about code reuse and splitting Linking File formats history Executable code loading Gabriel Laskar (EPITA) CAAL 2015 235 / 378 Object file formats Binary formats Linking Linking The operation that combines a list of object programs into a binary program is called linking. The tasks that must be accomplished by the linker are: 1. Named sections from different object programs must be merged together into one named section. 2. Merged sections must be put together into the sections of the memory model. 3. Each use of a name in an object program’s references list must be replaced by an address in the virtual address space. Gabriel Laskar (EPITA) CAAL 2015 236 / 378 Object file formats Binary formats Linking Merging object files Combining each .text section together, each .data section together, etc.: .text .text .text .rodata .rodata .rodata .data .data .data Gabriel Laskar (EPITA) CAAL 2015 237 / 378 Object file formats Binary formats Linking Zoom on .text section merging A object B object merged object 0x0section 0x0section 0x0 section A .text B .text .text C Symbol = S Symbol = B + S Gabriel Laskar (EPITA) CAAL 2015 238 / 378 Object file formats Binary formats Linking Symbol resolution raw binary format .text / .data a.out format file header .text .data relocations symbols Gabriel Laskar (EPITA) CAAL 2015 239 / 378 Object file formats Binary formats Linking Executable binary files When there is neither unresolved symbols nor relocations anymore, the file is executable. Executable files usually have a fixed constant load address on a given system. Unresolved/unused symbols may exist in an executable file if no relocation uses it. The strip command wipes out unused symbols from the symbol table. Gabriel Laskar (EPITA) CAAL 2015 240 / 378 Object file formats Binary formats File formats history 6: Object file formats Build process Binary formats Simple binary formats How about code reuse and splitting Linking File formats history Executable code loading Gabriel Laskar (EPITA) CAAL 2015 241 / 378 Object file formats Binary formats File formats history a.out: Assembler Output raw binary format .text / .data a.out format file header .text .data relocations symbols I Very simple I .. but pretty fast to load Gabriel Laskar (EPITA) CAAL 2015 242 / 378 Object file formats Binary formats File formats history COFF (Common Object File Format) and ELF (Executable and Linking Format) COFF format file section syms .text relocs .data header table ELF format file section relocs symbols .text .data header table I Permits use of dynamic libraries Gabriel Laskar (EPITA) CAAL 2015 243 / 378 Object file formats Binary formats File formats history Mach-O (Darwin) file load text rodata data header commands Gabriel Laskar (EPITA) CAAL 2015 244 / 378 Object file formats Binary formats File formats history Mach-O (Darwin) “Fat binaries” Mach−O fat binaries Fat header Fat arch table file load text rodata data header commands file load text rodata data header commands Gabriel Laskar (EPITA) CAAL 2015 245 / 378 Object file formats Executable code loading 6: Object file formats Build process Binary formats Executable code loading Binary file loading Loading process Dynamic libraries Gabriel Laskar (EPITA) CAAL 2015 246 / 378 Object file formats Executable code loading Binary file loading 6: Object file formats Build process Binary formats Executable code loading Binary file loading Loading process Dynamic libraries Gabriel Laskar (EPITA) CAAL 2015 247 / 378 Object file formats Executable code loading Binary file loading Binary file loading Executable file format is like object file format, with the following restrictions: I It has no relocations, I All addresses are resolved I It is ready to load in memory Gabriel Laskar (EPITA) CAAL 2015 248 / 378 Object file formats Executable code loading Binary file loading Executable format file header .text .rodata .data .bss (not in file) section table reloc. table symbol table string table objectGabriel Laskar file (EPITA) headers objectCAAL file sections 2015 249 / 378 Object file formats Executable code loading Binary file loading Executable loading Process memory structure is mapped to internal executable file format: I Loadable sections are loaded from file to memory I Uninitialised data (.bss section) is allocated in process memory I Internal file sections are ignored I Heap and stack are allocated Gabriel Laskar (EPITA) CAAL 2015 250 / 378 Object file formats Executable code loading Loading process 6: Object file formats Build process Binary formats Executable code loading Binary file loading Loading process Dynamic libraries Gabriel Laskar (EPITA) CAAL 2015 251 / 378 Object file formats Executable code loading Loading process Executable memory mapping process memory file header .text .text initialized .rodata .rodata memory .data .data .bss (not in file) .bss section table non reloc. table heap initialized memory symbol table stack string table object file headers object file sections memory sections Gabriel Laskar (EPITA) CAAL 2015 252 / 378 Object file formats Executable code loading Loading process Memory attributes Memory attributes depends on section and area type: I .text section may be loaded in executable only region (depending on OS) I memory used to store .rodata section will be marked as read-only I memory used to store .data, .bss sections, heap and stack will be marked as read/write Gabriel Laskar (EPITA) CAAL 2015 253 / 378 Object file formats Executable code loading Loading process Memory attributes executable file process memory .text .text −−x executable data pages .rodata loaded .rodata initialized r−− read only from file memory pages .data .data not in file .bss rw− read/write sym. table heap non pages internal initialized string table file data memory stack object file sections memory sections virtual memory rights Gabriel Laskar (EPITA) CAAL 2015 254 / 378 Object file formats Executable code loading Dynamic libraries 6: Object file formats Build process Binary formats Executable code loading Binary file loading Loading process Dynamic libraries Gabriel Laskar (EPITA) CAAL 2015 255 / 378 Object file formats Executable code loading Dynamic libraries Dynamic libraries Use of dynamic libraries implies: I different and more complex memory mapping I position independent code I different way to handle relocations I plenty of cool complicated stuff Gabriel Laskar (EPITA) CAAL 2015 256 / 378 Object file formats Executable code loading Dynamic libraries Dynamic libs in memory Libraries have specific memory mapping and organisation: I A dynamic library is loaded only once even if several processes use it I Extra .data and .text sections are mapped in process memory-space I Library .text section is mapped in several process memory I Library .data section is copied in each process memory Gabriel Laskar (EPITA) CAAL 2015 257 / 378 Object file formats Executable code loading Dynamic libraries Dynamic process memory layout executable A process A process B executable B .text .text .text .text .rodata .rodata .data .data .data .data .data .data copy library B library A copy .text .data .text .text .text .text heap heap stack stack Gabriel Laskar (EPITA) CAAL 2015 258 / 378 Object file formats Executable code loading Dynamic libraries Handling code location change Dynamic libraries may not be mapped with the same virtual address in all processes: I Address may already be in use by an other library I The same code must be able to run with different base addresses Position independent code solves these problems without going through the complex relocation process. Relative jumps and relative data memory accesses need to be handled by the CPU. Gabriel Laskar (EPITA) CAAL 2015 259 / 378 Object file formats Executable code loading Dynamic libraries Position independent code executable process .text .text .rodata 0100 jmp +12 .rodata .data .data position relative independent 0112 mov ... branch code .text library heap stack Gabriel Laskar (EPITA) CAAL 2015 260 / 378 Object file formats Executable code loading Dynamic libraries Position dependent code Some location change are more complicated with dynamic libraries: I .data can’t be referred with relative addressing from .text section if no relative data access is available. I .data section won’t have the same address in all process memory maps. I .text section is common to all processes and can’t reflect .data location differences. The solution is to use indirect memory access to reach data objects from code. One GOT (Global Offset Table) and PLT (Procedure Linkage Table) will hold data object addresses for each process. Some dynamic relocations and data copy will also help to solve this problem. Gabriel Laskar (EPITA) CAAL 2015 261 / 378 Object file formats Executable code loading Dynamic libraries Use of GOT/PLT GOT/PLT executable process in process A .text .text .rodata .rodata .data .data GOT/PLT .data call printf .text .text library heap stack Gabriel Laskar (EPITA) CAAL 2015 262 / 378 Assembly programming Part VII Assembly programming Gabriel Laskar (EPITA) CAAL 2015 263 / 378 Assembly programming Introduction 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Gabriel Laskar (EPITA) CAAL 2015 264 / 378 Assembly programming Introduction What is assembly? Assembly is not I binary data I directly interpreted by the processor I don’t mix it up with machine code Assembly is I a language, with a syntax, keywords, etc. I translated in a binary format Gabriel Laskar (EPITA) CAAL 2015 265 / 378 Assembly programming Introduction Benefits of assembly programming I Improve developper’s understanding of computer architecture and programming I Direct access to hardware resources I Access to system features I Speed and efficiency of programs I Low footprint of programs Gabriel Laskar (EPITA) CAAL 2015 266 / 378 Assembly programming Introduction Drawbacks of assembly programming I Low-level programming forbids abstraction I Assembly code is hard to keep clean I Slows development down: lots of keywords, unusual behaviors I Each processor architecture has its own language: conventions, registers, instructions, privileges, allowed arguments Gabriel Laskar (EPITA) CAAL 2015 267 / 378 Assembly programming Pure assembly files 7: Assembly programming Introduction Pure assembly files Anatomy of assembly code Examples Inline Assembly in C Gabriel Laskar (EPITA) CAAL 2015 268 / 378 Assembly programming Pure assembly files Anatomy of assembly code 7: Assembly programming Introduction Pure assembly files Anatomy of assembly code Examples Inline Assembly in C Gabriel Laskar (EPITA) CAAL 2015 269 / 378 Assembly programming Pure assembly files Anatomy of assembly code Assembly file organization Source file is organized in different sections: code Contains program binary opcodes data Contains program global variables rodata Contains read-only program variables Gabriel Laskar (EPITA) CAAL 2015 270 / 378 Assembly programming Pure assembly files Anatomy of assembly code Assembly language statements directives Determine the assembler behavior instructions Chosen in the CPU instruction set, translated in binary format, written in output file operands Expressions containing register identifiers, immediate operands, symbols labels Between instructions, used to mark specific locations in source code Gabriel Laskar (EPITA) CAAL 2015 271 / 378 Assembly programming Pure assembly files Examples 7: Assembly programming Introduction Pure assembly files Anatomy of assembly code Examples Inline Assembly in C Gabriel Laskar (EPITA) CAAL 2015 272 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example .mod_load asm-sparc .define FOO5 .section code .text .mod_asm opcodes v8 add %g0, FOO, %g1 .ends Gabriel Laskar (EPITA) CAAL 2015 273 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example .mod_load asm-sparc .mod_load out-elf32 .section code .text .mod_asm opcodes v8 mov 0x12, %g1 mov 0x34533, %g2 add %g1, %g2, %g3 .ends Gabriel Laskar (EPITA) CAAL 2015 274 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example .mod_load asm-sparc .include sparc/v8.def .section data .data lbl: .reserve4 .ends .section code .text .mod_asm opcodes v8 @set .data:lbl, %g2 ld[%g2],%g1 .ends Gabriel Laskar (EPITA) CAAL 2015 275 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example .mod_load asm-sparc .include sparc/v8.def .section code .text .mod_asm opcodes v8 .export main .proc main ret restore .endp .ends Gabriel Laskar (EPITA) CAAL 2015 276 / 378 Assembly programming Pure assembly files Examples Assembly language statements Example .mod_load asm-sparc .include sparc/v8.def .section code .text .mod_asm opcodes v8 .extern exit .proc my_exit call exit nop .endp .ends Gabriel Laskar (EPITA) CAAL 2015 277 / 378 Assembly programming Inline Assembly in C 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz Gabriel Laskar (EPITA) CAAL 2015 278 / 378 Assembly programming Inline Assembly in C Presentation 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz Gabriel Laskar (EPITA) CAAL 2015 279 / 378 Assembly programming Inline Assembly in C Presentation Why? I C can’t express everything I Cpu-specific registers (flags, condition codes, hardware counters) I Features not addressed by language (atomic operations) I System features (Memory handling, interrupts handling, exception handling) I Compiler are not always aware of possible optimizations I Use of instruction side-effects I Un-optimizable patterns (or optimization patterns specific to only one algorithm) Gabriel Laskar (EPITA) CAAL 2015 280 / 378 Assembly programming Inline Assembly in C Presentation Inline Assembly Compiler’s PoV For the compiler, the assembly you pass in is: I a raw string I with a printf-like syntax I passed directly to the assembler I surounded by optional statements I input values I output values I clobbered values Gabriel Laskar (EPITA) CAAL 2015 281 / 378 Assembly programming Inline Assembly in C Presentation Inline Assembly Programmer’s PoV For you, the assembly you pass is meant to either I compute a value I change some hardware feature I have a side-effect Gabriel Laskar (EPITA) CAAL 2015 282 / 378 Assembly programming Inline Assembly in C Simple examples 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz Gabriel Laskar (EPITA) CAAL 2015 283 / 378 Assembly programming Inline Assembly in C Simple examples Example No data static inline void disable_interrupts(void) { asm volatile("cli"); } Gabriel Laskar (EPITA) CAAL 2015 284 / 378 Assembly programming Inline Assembly in C Simple examples Example Output only static inline cpu_cycle_t cpu_cycle_count(void) { uint32_t low, high; asm("rdtsc": "=a" (low), "=d" (high)); return (low|(( uint64_t)high << 32)); } Gabriel Laskar (EPITA) CAAL 2015 285 / 378 Assembly programming Inline Assembly in C Simple examples Example Input only static inline void cpu_io_write_8(uintptr_t addr, uint8_t data) { asm volatile("outb %0, %1" : : "a" (data), "d"(( uint16_t)addr) ); } Gabriel Laskar (EPITA) CAAL 2015 286 / 378 Assembly programming Inline Assembly in C Simple examples Example In-out static inline uint32_t cpu_endian_swap32(uint32_t x) { asm ("bswap %0" : "=r" (x) : "0" (x) ); return x; } Gabriel Laskar (EPITA) CAAL 2015 287 / 378 Assembly programming Inline Assembly in C Syntax 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz Gabriel Laskar (EPITA) CAAL 2015 288 / 378 Assembly programming Inline Assembly in C Syntax Syntax I one statement with optional arguments asm [volatile]("statements \n\t" "on many lines \n\t" [: [output_variables] [: [input_variables] [: clobbered_registers]]] ); I arguments are referenced in-order (%0..%n), whether they are input or output! I arguments types abide constraints, enclosed in"" Gabriel Laskar (EPITA) CAAL 2015 289 / 378 Assembly programming Inline Assembly in C Complex examples 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz Gabriel Laskar (EPITA) CAAL 2015 290 / 378 Assembly programming Inline Assembly in C Complex examples Add with carry and overflow C uint32_t add_cv(uint32_t a, uint32_t b, uint8_t cin, uint8_t *cout, uint8_t *vout) { uint64_t sum=( uint64_t)a+( uint64_t)b+( uint64_t)cin; *cout= sum >> 32; *vout= ( (b^(( uint32_t)sum))&~(a^b) ) >> 31; return sum; } Gabriel Laskar (EPITA) CAAL 2015 291 / 378 Assembly programming Inline Assembly in C Complex examples Add with carry and overflow ASM uint32_t add_cv(uint32_t a, uint32_t b, uint8_t cin, uint8_t *cout, uint8_t *vout) { uint32_t result; asm(" btl $0, %k5 \n" " adcl %k3, %k4 \n" " setc %b1 \n" " seto %b2 \n" : "=r" (result), "=qm"(*cout), "=qm"(*vout) : "r" (a), "0" (b), "r" (cin) ); return result; } Gabriel Laskar (EPITA) CAAL 2015 292 / 378 Assembly programming Inline Assembly in C Complex examples Atomic increment for ARM static inline bool_t cpu_atomic_inc(volatile atomic_int_t *a) { reg_t tmp=0, tmp2; asm volatile("1: \n\t" "ldrex %0, [%2] \n\t" "add %0, %0, #1 \n\t" "strex %1, %0, [%2] \n\t" "tst %1, #1 \n\t" "bne 1b \n\t" : "=&r" (tmp), "=&r" (tmp2) : "r" (a) : "m"(*a) ); Gabrielreturn Laskar (EPITA)tmp !=0; CAAL 2015 293 / 378 } Assembly programming Inline Assembly in C Named values 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz Gabriel Laskar (EPITA) CAAL 2015 294 / 378 Assembly programming Inline Assembly in C Named values Named values I Using numbers for values makes code harder to write and read I GCC permits named values in inline assembly Gabriel Laskar (EPITA) CAAL 2015 295 / 378 Assembly programming Inline Assembly in C Named values Named values With numbers static inline bool_t cpu_atomic_inc(volatile atomic_int_t *a) { reg_t tmp=0, tmp2; asm volatile("1: \n\t" "ldrex %0, [%2] \n\t" "add %0, %0, #1 \n\t" "strex %1, %0, [%2] \n\t" "tst %1, #1 \n\t" "bne 1b \n\t" : "=&r" (tmp), "=&r" (tmp2) : "r" (a) : "m"(*a) ); Gabriel Laskar (EPITA) CAAL 2015 296 / 378 return tmp !=0; } Assembly programming Inline Assembly in C Named values Named values Named static inline bool_t cpu_atomic_inc(volatile atomic_int_t *a) { reg_t tmp=0, tmp2; asm volatile("1: \n\t" "ldrex %[tmp], [%[atomic]] \n\t" "add %[tmp], %[tmp], #1 \n\t" "strex %[tmp2], %[tmp], [%[atomic]] \n\t" "tst %[tmp2], #1 \n\t" "bne 1b \n\t" : [tmp] "=&r" (tmp), [tmp2] "=&r" (tmp2) , [clobber] "=m"(*a) : [atomic] "r" (a) ); Gabriel Laskar (EPITA) CAAL 2015 297 / 378 return tmp !=0; } Assembly programming Inline Assembly in C Quizz 7: Assembly programming Introduction Pure assembly files Inline Assembly in C Presentation Simple examples Syntax Complex examples Named values Quizz Gabriel Laskar (EPITA) CAAL 2015 298 / 378 Assembly programming Inline Assembly in C Quizz Quizz! What does this do? asm volatile("":::"memory"); Gabriel Laskar (EPITA) CAAL 2015 299 / 378 Focus on x86 Part VIII Focus on x86 Gabriel Laskar (EPITA) CAAL 2015 300 / 378 Focus on x86 x86 story 8: Focus on x86 x86 story Timeline x86 architecture Instruction set Gabriel Laskar (EPITA) CAAL 2015 301 / 378 Focus on x86 x86 story Timeline 8: Focus on x86 x86 story Timeline x86 architecture Instruction set Gabriel Laskar (EPITA) CAAL 2015 302 / 378 Focus on x86 x86 story Timeline Early events and CPU development 1971 Intel 4004 1978 Intel 8086 & 8088, first x86 CPUs 1981 Intel 80186 1982 Intel 80286: 16 bits, 24 address bits, protection Gabriel Laskar (EPITA) CAAL 2015 303 / 378 Focus on x86 x86 story Timeline Successful CPU development 1985 Intel 80386: 32 bits, MMU 1989 Intel 80486: on-chip cache, pipeline 1993 Intel PentiumTM: superscalar, mmx, 64 address bits 1995 Intel Pentium Pro: ooo 1997 Intel Pentium II: internal L2 1999 Intel Pentium III: cpuid! 1999 AMD Athlon: ooo, 3dnow Gabriel Laskar (EPITA) CAAL 2015 304 / 378 Focus on x86 x86 story Timeline Technology barriers 2000 Intel Pentium 4: HT 2001 Intel Itanium 2003 AMD Athlon 64 2003 Intel Pentium M: P6 (PPro to PIII) 2006 Intel Pentium 4 Prescott 2M: 64 bits 2006 Intel Core/Core2: fusion 2009 Intel Core i5/i7: ... Gabriel Laskar (EPITA) CAAL 2015 305 / 378 Focus on x86 x86 story x86 architecture 8: Focus on x86 x86 story Timeline x86 architecture Instruction set Gabriel Laskar (EPITA) CAAL 2015 306 / 378 Focus on x86 x86 story x86 architecture x86 architecture The x86 architecture is: I based on a CISC instruction set. I based on little-endian memory access. I based on two operands instructions including memory access. I backwards compatible: all new x86 processors are fully compatible with their predecessors. I very complex: Pentium IV has a 20 stages pipeline. I newer processors have less pipeline stages Gabriel Laskar (EPITA) CAAL 2015 307 / 378 Focus on x86 x86 story x86 architecture Available registers First x86 generation had a few 16-bits registers available: I General purpose 16 bits registers: I ax, bx, cx, dx, si, di I General purpose 8 bits registers: I al, bl, cl, dl I ah, bh, ch, dh I Stack and frame registers: I sp, bp I Flag registers, ... Gabriel Laskar (EPITA) CAAL 2015 308 / 378 Focus on x86 x86 story x86 architecture Nested registers AX 16 bits register AH AL 8 bits registers Gabriel Laskar (EPITA) CAAL 2015 309 / 378 I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, I r8, r9, r10, r11, r12, r13, r14, r15 I AMD licensed the amd64 to Intel after the Itanium flop... Focus on x86 x86 story x86 architecture Register extensions I Starting with the 386 processor, all registers are now 32-bits wide. I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations. Gabriel Laskar (EPITA) CAAL 2015 310 / 378 I r8, r9, r10, r11, r12, r13, r14, r15 I AMD licensed the amd64 to Intel after the Itanium flop... Focus on x86 x86 story x86 architecture Register extensions I Starting with the 386 processor, all registers are now 32-bits wide. I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations. I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, Gabriel Laskar (EPITA) CAAL 2015 310 / 378 I AMD licensed the amd64 to Intel after the Itanium flop... Focus on x86 x86 story x86 architecture Register extensions I Starting with the 386 processor, all registers are now 32-bits wide. I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations. I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, I r8, r9, r10, r11, r12, r13, r14, r15 Gabriel Laskar (EPITA) CAAL 2015 310 / 378 Focus on x86 x86 story x86 architecture Register extensions I Starting with the 386 processor, all registers are now 32-bits wide. I Due to compatibility issue, 16 bits register are still present. I eax, ebx, ecx, edx, esi, edi, esp, ebp were added. I Even if registers size is increased, registers count remained the same: only 6 registers are available for operations. I For amd64, AMD extended the register count to 16, only available with 64 bits mode enabled I rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, I r8, r9, r10, r11, r12, r13, r14, r15 I AMD licensed the amd64 to Intel after the Itanium flop... Gabriel Laskar (EPITA) CAAL 2015 310 / 378 Focus on x86 x86 story x86 architecture 32/64 bits nested registers RAX 64 bits register EAX 32 bits register AX 16 bits register AH AL 8 bits registers Gabriel Laskar (EPITA) CAAL 2015 311 / 378 Focus on x86 x86 story x86 architecture Instruction format x86 architecture is a CISC based architecture: I All instructions have a variable length, I Instructions length can’t be determined before reading first bytes, I Many instructions with different formats are available. Gabriel Laskar (EPITA) CAAL 2015 312 / 378 Focus on x86 x86 story x86 architecture Instruction format 8086-80286 opcode opc2 mod immediate memory r/m displacement Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture Instruction format 80386-Pentium data address size size prefix prefix opcode prefix opcode opc2 mod SIB immediate memory r/m displacement Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture Instruction format 3dnow! data address size size prefix prefix opcode prefix opcode opc2 mod SIB immediate memory r/m displacement imm 3dnow Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture Instruction format x86-64 x86−64 data address prefix size size prefix prefix opcode prefix opcode opc2 mod SIB immediate memory r/m displacement imm 3dnow Gabriel Laskar (EPITA) CAAL 2015 313 / 378 Focus on x86 x86 story x86 architecture Addressing mode The x86 architecture supports several complex addressing modes. On 32 bit processors (386 and later) a memory address can contain: I A base address register I A address displacement I An index register I An multiply factor on index register Example: mov eax, [ebx + 0x12345 + ecx * 8] Gabriel Laskar (EPITA) CAAL 2015 314 / 378 Focus on x86 x86 story x86 architecture Wired stack management Specific instructions are available for stack management: I push x instruction can be used to store data on the top of the stack and decrement the stack pointer. I pop x instruction removes data from the stack. I call and ret instructions directly push and pop return address on and from the stack. The (e)sp register is used as stack pointer. Gabriel Laskar (EPITA) CAAL 2015 315 / 378 Focus on x86 x86 story x86 architecture Branch I The x86 architecture uses a flag register to manage conditional branches. I Many different instructions with different lengthes can be used depending on jump size. I No delayed slot are used Gabriel Laskar (EPITA) CAAL 2015 316 / 378 Focus on x86 x86 story x86 architecture Jump size long jump short jump (−32768 to +32767) (−128 to +127) E9 EB opc immediate opc imm. 16/32 bits 8 bits Gabriel Laskar (EPITA) CAAL 2015 317 / 378 Focus on x86 x86 story Instruction set 8: Focus on x86 x86 story Timeline x86 architecture Instruction set Gabriel Laskar (EPITA) CAAL 2015 318 / 378 Focus on x86 x86 story Instruction set Instruction set x86 instruction set is huge: I Over 580 instructions handled by Pentium 3 CPU I Over 850 opcodes handled by Pentium 3 CPU A single instruction can be very complex: I Memory access capable I Handle different data widths I Perform complex computation Gabriel Laskar (EPITA) CAAL 2015 319 / 378 Focus on x86 x86 story Instruction set Instruction set extensions Extensions to the x86 instruction set are common and all instructions are not available on all CPUs. Starting with the pentium processor, SIMD-based instruction sets appeared: I MMX I 3dNow! I MMX2, SSE I 3dNow2!, SSE2, SSE3, SSE4s I To be continued... Gabriel Laskar (EPITA) CAAL 2015 320 / 378 Focus on x86 x86 story Instruction set Instruction set SIMD operations normal operation SIMD operation source 1 64 bits value 4 x 16 bits register source 2 64 bits value 4 x 16 bits register destination 64 bits value 4 x 16 bits register Gabriel Laskar (EPITA) CAAL 2015 321 / 378 Focus on x86 x86 story Instruction set x86-specific coding Due to its amazing number of available instructions, x86 code is highly tunable for optimizations: I High level languages compilers are not able to use all instructions efficiently, I Hand written assembly is often faster, I Complex Instructions can be used to process unexpected tasks. Gabriel Laskar (EPITA) CAAL 2015 322 / 378 Focus on x86 x86 story Instruction set Instruction set has-been parts I aaa, aad, aam, aas, daa, das I xlat Gabriel Laskar (EPITA) CAAL 2015 323 / 378 Focus on x86 x86 story Instruction set x86-specific coding example: LEA The LEA instruction is designed to compute memory addresses. But it can be used to add and multiply values. lea eax,[ebx+ ecx] lea eax,[ebx*8+ ebx] Gabriel Laskar (EPITA) CAAL 2015 324 / 378 Focus on x86 x86 story Instruction set x86-specific coding example: tiniest mem*()? memcpy rep movbs memset rep stosd memcmp repe cmpsb strncmp repnz cmpsb Gabriel Laskar (EPITA) CAAL 2015 325 / 378 Focus on RISC processors Part IX Focus on RISC processors Gabriel Laskar (EPITA) CAAL 2015 326 / 378 Focus on RISC processors History 9: Focus on RISC processors History Some instruction sets Gabriel Laskar (EPITA) CAAL 2015 327 / 378 Focus on RISC processors History Let’s see a timeline... 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 CDC 6600 Superscalar (Cray) 3DO DS GBA Data General Nova AcornARM projectARM 1 ARM 2 ARMARM ltd 6 (v3) ARM 7 ARM 9−tdmi ARM 11 M3 A8 M1 M0 (Edson de Castro) Cambridge Thumb A9 A9 >2GHz Hard / +licensingSoft core Queen award SUN DARPARISC Project RISC−I RISC−II SPARC SPARC−v8 SPARC−v9 Ultra−1 Ultra−2 Ultra−3 UC−Berkeley Open source Patterson PSP N64 SGI PlayStation Mips Mips−r2000 Mips−r3000 Mips−r4000 Mips−r8k Mips−r10k Mips−r12k PlayStation 2 Mips−r16k r24k DARPA Stanford 64−bit Mips IV Mips V Mips32 Soft−core Hennessy +licensing PS3 IBM RISC CPU Cheetah America Bellatrix STI project CELL Unsure A−I−M RS−6000 RSC PPC POWER 3 POWER 4 POWER 5 DARPAPOWER 7 RIOS 1/9 64 bits Hard−core GameCube Wii XBox 360 DEC PRISM Alpha Alpha Alpha Alpha EV4 EV5 EV6 EV7 21064 21164 21264 21364 Unsure Saturn HitachiSuper H SH−2 SH−3 SH−4 SH−6 SH−7 Dreamcast (announced) licensing issues HP TS1 PCX−S PCX−L PCX−U PCX−W+ Shortfin PA−Risc PA−7000 PA−7100−LC PA−8000 PA−8600 PA−8900 Unsure (SIMD) Gabriel Laskar (EPITA) CAAL 2015 328 / 378 Focus on RISC processors Some instruction sets 9: Focus on RISC processors History Some instruction sets Mips SPARC PPC ARM Gabriel Laskar (EPITA) CAAL 2015 329 / 378 Focus on RISC processors Some instruction sets Mips 9: Focus on RISC processors History Some instruction sets Mips SPARC PPC ARM Gabriel Laskar (EPITA) CAAL 2015 330 / 378 Focus on RISC processors Some instruction sets Mips Mips Instruction format 012345678910111213141516171819202122232425262728293031 J/JAL offset OP rs rt imm Special rs rt rd ... op I 32 GPR I hard-wired r0 to 0 I 32 FPU registers, all sizes aliased I delay slot Gabriel Laskar (EPITA) CAAL 2015 331 / 378 Focus on RISC processors Some instruction sets Mips Mips Asm code addiu r2, r3, 0x42 ; alu reg / imm or r3, r4, r5 ; alu reg / reg lui r2, 0x1234 ; load upper immediate ori r2, r2, 0x5678 ; r2 = 0x12345678 lw r4, 0x1234(r9) ; load word from r9+ 0x1234 slt r1, r2, r3 ; test if r2 < r3 bnez r1,2f ; jump if true nop ; delay slot lbu r2,(r3) ; load byte, not sign extended jal foo ; pc = foo; r31 = return address Gabrielnop Laskar (EPITA) CAAL 2015 332 / 378 Focus on RISC processors Some instruction sets SPARC 9: Focus on RISC processors History Some instruction sets Mips SPARC PPC ARM Gabriel Laskar (EPITA) CAAL 2015 333 / 378 Focus on RISC processors Some instruction sets SPARC SPARC Instruction format 012345678910111213141516171819202122232425262728293031 01 offset 012345678910111213141516171819202122232425262728293031 00 rd op2 imm 00 a cond op2 imm 012345678910111213141516171819202122232425262728293031 1x rd op3 rs1 0 asi rs2 1x rd op3 rs1 1 imm 1x rd op3 rs1 opf rs2 I 32 GPR I hard-wired %g0 to 0 I register window I unwindowed aliased FPU registers, 8 * 128 bits, 32 * 32 bits I Gabrieldelay Laskar slot (EPITA) CAAL 2015 334 / 378 Focus on RISC processors Some instruction sets SPARC SPARC Asm code add %g3, 0x42, %g2 ; alu reg / imm or %g3, %g4, %g5 ; alu reg / reg sethi 0x12345, %g2 ; load upper immediate(20 bits) or %g2, 0x678, %g2 ; g2 = 0x12345678 ld[%l2+ 0x1234], %g4 ; load word from l2+ 0x1234 ld[%l2+ %i3], %g4 ; load word from l2+ i3 tst %l2, %l3 ; test if l2 < l3 blt3f ; jump if true Gabriel Laskar (EPITA) CAAL 2015 335 / 378 Focus on RISC processors Some instruction sets PPC 9: Focus on RISC processors History Some instruction sets Mips SPARC PPC ARM Gabriel Laskar (EPITA) CAAL 2015 336 / 378 Focus on RISC processors Some instruction sets PPC PowerPC Instruction format 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 op jump offset aa lk op rd ra imm op rd ra rb mb me rc op rd ra rb op2 rc I 32 GPR I 32 FPR, fixed internal representation I additional registers for lr, ctr I no global flags, but 8 “cause registers” crx I can merge causes with logical operations Gabriel Laskar (EPITA) CAAL 2015 337 / 378 Focus on RISC processors Some instruction sets PPC PowerPC Asm code addi3,2, 42 ; alu reg / imm or.3,4,5 ; alu reg / reg, update cr0 lis3, 0x1234 ; load upper immediate(16 bits) ori3,3, 0x5678 ; r3 = 0x12345678 lwz4,9, 0x1234 ; load word from r9+ 0x1234 lwzx4,9,3 ; load word from r9+ r3 mtctr4 ; put r4 in count register ... bdnzl2f ; Decrement CTR, Branch if CTR! = 0 rlwinm ... ; Rotate and mask Gabriel Laskar (EPITA) CAAL 2015 338 / 378 Focus on RISC processors Some instruction sets ARM 9: Focus on RISC processors History Some instruction sets Mips SPARC PPC ARM Gabriel Laskar (EPITA) CAAL 2015 339 / 378 You can prefix any instruction with “never”: I 1/16th of the instruction set is “nop”... Focus on RISC processors Some instruction sets ARM ARM Instruction features I 16 GPR, one of them is pc (r15) I Every instruction is “guarded” by a condition. It may be: I always, >, >=, “higher”, overflow, negative, carry, == I their opposites I aliased FPR: 16 double, 32 float Gabriel Laskar (EPITA) CAAL 2015 340 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction features I 16 GPR, one of them is pc (r15) I Every instruction is “guarded” by a condition. It may be: I always, >, >=, “higher”, overflow, negative, carry, == I their opposites I aliased FPR: 16 double, 32 float You can prefix any instruction with “never”: I 1/16th of the instruction set is “nop”... Gabriel Laskar (EPITA) CAAL 2015 340 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction format 012345678910111213141516171819202122232425262728293031 cond 00 i op s rn rd imm/reg cond 01 i p u bw l rn rd imm/reg cond 100 p u bw l rn reg-list cond 101 l jmp offset cond 11 coproc, sys Gabriel Laskar (EPITA) CAAL 2015 341 / 378 and r3, r7, #42 ; r3 = r7 & 0x2a add r2, r8, r5, lsl r1 ; r2 = r8+(r5 << r1) sub r1, r9, r1, lsl #1 ; r1 = r9 - (r1 << 1) 012345678910111213141516171819202122232425262728293031 1110 00 1 and 0 r7 r3 « 0 0x2a 1110 00 0 add 0 r8 r2 r1 0 lsl 1 r5 1110 00 0 sub 0 r9 r1 0x1 lsl 0 r1 Focus on RISC processors Some instruction sets ARM ARM Instruction format (2) 012345678910111213141516171819202122232425262728293031 cond 00 i op s rn rd imm/reg cond 01 i p u bw l rn rd imm/reg 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm Gabriel Laskar (EPITA) CAAL 2015 342 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction format (2) 012345678910111213141516171819202122232425262728293031 cond 00 i op s rn rd imm/reg cond 01 i p u bw l rn rd imm/reg 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm and r3, r7, #42 ; r3 = r7 & 0x2a add r2, r8, r5, lsl r1 ; r2 = r8+(r5 << r1) sub r1, r9, r1, lsl #1 ; r1 = r9 - (r1 << 1) 012345678910111213141516171819202122232425262728293031 1110 00 1 and 0 r7 r3 « 0 0x2a 1110 00 0 add 0 r8 r2 r1 0 lsl 1 r5 1110Gabriel00 Laskar0 (EPITA)sub 0 r9 r1 CAAL0x1 lsl 0 r1 2015 342 / 378 Focus on RISC processors Some instruction sets ARM ARM Asm code add r3, r2, #0x42 ; alu reg / imm orr r3, r4, r5 ; alu reg / reg bic r3, r4, r5, lsr #4 ; alu reg / reg & shift cmp r2, #0 ; test if r2 < 0 rsblt r2, r2, #0 ; then r2 = 0 - r2 ldr r2, #0x12345678 ; rewritten as ldr r2,[pc, #offset] ; pc-relative load streq r4,[r2, r3, lsl #4] ; store word to r2 + r3 << 4 ; only if last cmp is ‘eq’ str r4,[r2, #8]! ; store word to r2 + 8 ; and r2 = r2+8 offset: ; 32-bit immediates zone Gabriel0x12345678 Laskar (EPITA) CAAL 2015 343 / 378 Focus on RISC processors Some instruction sets ARM ARM Block transfers 012345678910111213141516171819202122232425262728293031 cond 100 p u bw l rn reg-list stmdb sp!, {r2, r3, r4, r8} push {r2, r3, r4, r8} 012345678910111213141516171819202122232425262728293031 cond 100 1 0 0 1 0 r13 0000000100011100 Gabriel Laskar (EPITA) CAAL 2015 344 / 378 Focus on RISC processors Some instruction sets ARM ARM Block transfers hacks (1) The link register is r14, then the following code saves r14 and restores it to PC afterwards... push {r2, r3, r4, r8, lr} ... pop {r2, r3, r4, r8, pc} ..00 ..04 sp → r2 ..08 r3 ..0c r4 ..10 r8 ..14 lr ..18 old sp → xxx ..1c ..20 Gabriel Laskar (EPITA) CAAL 2015 345 / 378 Focus on RISC processors Some instruction sets ARM ARM Block transfers hacks (2) Do a memcpy with some free registers, for instance, copy 16 bytes from *r0 to *r1, not using r2 nor r5, update r0 and r1 to point on next block... ldmia r0!, {r3, r4, r6, r7} stmia r1!, {r3, r4, r6, r7} Gabriel Laskar (EPITA) CAAL 2015 346 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction-space exhaustion Eventually, ARM instruction-set continued to grow despite the obvious exhaustion of the “clean” instruction-set space. I abuse of concatenation of seldom bits around the instruction word I sometimes forbid r15 as an operand, and reuse other fields I reuse the “never” condition code Gabriel Laskar (EPITA) CAAL 2015 347 / 378 cond 00 0 000 a s rn rd rs 1001 rm Focus on RISC processors Some instruction sets ARM ARM Instruction-space exhaustion Illustration 012345678910111213141516171819202122232425262728293031 cond 00 I op s rn rd op2 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm How to add the multiplication? Gabriel Laskar (EPITA) CAAL 2015 348 / 378 Focus on RISC processors Some instruction sets ARM ARM Instruction-space exhaustion Illustration 012345678910111213141516171819202122232425262728293031 cond 00 I op s rn rd op2 1 rotate imm 0 rs 0 ty 1 rm 0 amount ty 0 rm How to add the multiplication? cond 00 0 000 a s rn rd rs 1001 rm Gabriel Laskar (EPITA) CAAL 2015 348 / 378 Focus on RISC processors Some instruction sets ARM Thumb mode One goal: Code compression. Concessions to pack a maximum of operations in a minimal space: I Access to only 8 registers among 16 I Implicit stack ops I Implicit pc-relative operations I No predicates any more Dirty hacks for making it work: I Use lower bit of r15 as a mode indicator I Use a new bx/blx instruction Gabriel Laskar (EPITA) CAAL 2015 349 / 378 I 16 bit instruction words are not enough any more Never mind I Keep the basic 16-bit thumb encoding I When we need more, just make the instruction 32-bits long... I Decode mixed 16/32-bits instructions I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding! Focus on RISC processors Some instruction sets ARM Thumb2 Keep up with thumb, this is great. Ideas: I Have an asm-code compatibility with ARM I Remap the whole ARM 32-bit instruction set in 16-bit thumb Gabriel Laskar (EPITA) CAAL 2015 350 / 378 Never mind I Keep the basic 16-bit thumb encoding I When we need more, just make the instruction 32-bits long... I Decode mixed 16/32-bits instructions I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding! Focus on RISC processors Some instruction sets ARM Thumb2 Keep up with thumb, this is great. Ideas: I Have an asm-code compatibility with ARM I Remap the whole ARM 32-bit instruction set in 16-bit thumb I 16 bit instruction words are not enough any more Gabriel Laskar (EPITA) CAAL 2015 350 / 378 I Decode mixed 16/32-bits instructions I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding! Focus on RISC processors Some instruction sets ARM Thumb2 Keep up with thumb, this is great. Ideas: I Have an asm-code compatibility with ARM I Remap the whole ARM 32-bit instruction set in 16-bit thumb I 16 bit instruction words are not enough any more Never mind I Keep the basic 16-bit thumb encoding I When we need more, just make the instruction 32-bits long... Gabriel Laskar (EPITA) CAAL 2015 350 / 378 I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding! Focus on RISC processors Some instruction sets ARM Thumb2 Keep up with thumb, this is great. Ideas: I Have an asm-code compatibility with ARM I Remap the whole ARM 32-bit instruction set in 16-bit thumb I 16 bit instruction words are not enough any more Never mind I Keep the basic 16-bit thumb encoding I When we need more, just make the instruction 32-bits long... I Decode mixed 16/32-bits instructions Gabriel Laskar (EPITA) CAAL 2015 350 / 378 Then Thumb-2 is a RISC with a CISC encoding! Focus on RISC processors Some instruction sets ARM Thumb2 Keep up with thumb, this is great. Ideas: I Have an asm-code compatibility with ARM I Remap the whole ARM 32-bit instruction set in 16-bit thumb I 16 bit instruction words are not enough any more Never mind I Keep the basic 16-bit thumb encoding I When we need more, just make the instruction 32-bits long... I Decode mixed 16/32-bits instructions I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Gabriel Laskar (EPITA) CAAL 2015 350 / 378 Focus on RISC processors Some instruction sets ARM Thumb2 Keep up with thumb, this is great. Ideas: I Have an asm-code compatibility with ARM I Remap the whole ARM 32-bit instruction set in 16-bit thumb I 16 bit instruction words are not enough any more Never mind I Keep the basic 16-bit thumb encoding I When we need more, just make the instruction 32-bits long... I Decode mixed 16/32-bits instructions I Only 16-bit alignment constraint for 32-bit long thumb-2 ops Then Thumb-2 is a RISC with a CISC encoding! Gabriel Laskar (EPITA) CAAL 2015 350 / 378 CPU-aware optimizations Part X CPU-aware optimizations Gabriel Laskar (EPITA) CAAL 2015 351 / 378 CPU-aware optimizations Rationale 10: CPU-aware optimizations Rationale Examples Code readability guidelines Gabriel Laskar (EPITA) CAAL 2015 352 / 378 CPU-aware optimizations Rationale Purpose Knowing the low-level implementation of the processors, we can: I write code easier for the compiler to translate I write code easier for the CPU to run I because nicer with the pipeline I because nicer with the memory subsystem Gabriel Laskar (EPITA) CAAL 2015 353 / 378 CPU-aware optimizations Examples 10: CPU-aware optimizations Rationale Examples abs() Sign extension Power of two detection Mask merging Code readability guidelines Gabriel Laskar (EPITA) CAAL 2015 354 / 378 CPU-aware optimizations Examples abs() 10: CPU-aware optimizations Rationale Examples abs() Sign extension Power of two detection Mask merging Code readability guidelines Gabriel Laskar (EPITA) CAAL 2015 355 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() The absolute value is a good example of optimized code which is easy to implement in an efficient and smart way. Gabriel Laskar (EPITA) CAAL 2015 356 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline abs() basic code int abs(int x) { if (x<0) return -x; else return x; } int abs(int x) { return (x<0)?-x : x; } Gabriel Laskar (EPITA) CAAL 2015 357 / 378 I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return I if (x > 0), return I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return I if (x > 0), return I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return I if (x > 0), return I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return I if (x > 0), return I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I if (x < 0), return I if (x > 0), return I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return -x I if (x > 0), return x Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return ((not x) + 1) I if (x > 0), return x Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return ((x ^ 0xffffffff) - 0xffffffff) I if (x > 0), return ((x ^ 0) + 0) Gabriel Laskar (EPITA) CAAL 2015 358 / 378 I Sign extension: (x » 31) CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return ((x ^ 0xffffffff) - 0xffffffff) I if (x > 0), return ((x ^ 0) + 0) How to produce -1 when x < 0? Gabriel Laskar (EPITA) CAAL 2015 358 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline example: abs() Let’s go back to some mathematical principles: I Addition: a + 1 = a - (-1) I Number representation: -1 = 0xffffffff I Negation: -a = (not a) + 1 I not a = a xor 0xffffffff = a xor -1 I if (x < 0), return ((x ^ 0xffffffff) - 0xffffffff) I if (x > 0), return ((x ^ 0) + 0) How to produce -1 when x < 0? I Sign extension: (x » 31) Gabriel Laskar (EPITA) CAAL 2015 358 / 378 int abs(int x) { int sign_word=-(1& (x >> 31)); return (x^ sign_word)- sign_word; } CPU-aware optimizations Examples abs() Nicer with the pipeline abs() better code int abs(int x) { int sign_word=x >> 31; return (x^ sign_word)- sign_word; } Gabriel Laskar (EPITA) CAAL 2015 359 / 378 CPU-aware optimizations Examples abs() Nicer with the pipeline abs() better code int abs(int x) { int sign_word=x >> 31; return (x^ sign_word)- sign_word; } int abs(int x) { int sign_word=-(1& (x >> 31)); return (x^ sign_word)- sign_word; } Gabriel Laskar (EPITA) CAAL 2015 359 / 378 CPU-aware optimizations Examples Sign extension 10: CPU-aware optimizations Rationale Examples abs() Sign extension Power of two detection Mask merging Code readability guidelines Gabriel Laskar (EPITA) CAAL 2015 360 / 378 -high 11111111111111111111000000000000 result ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 361 / 378 result ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn -high 11111111111111111111000000000000 Gabriel Laskar (EPITA) CAAL 2015 361 / 378 int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn -high 11111111111111111111000000000000 result ssssssssssssssssssssnnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 361 / 378 CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size Sometimes, we need to sign extend a value from an arbitrary word width (say 13 bits) to a CPU word (say 32 bits). How to do it? x 0000000000000000000snnnnnnnnnnnn -high 11111111111111111111000000000000 result ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val& high)? (val|(-high)): val; } Gabriel Laskar (EPITA) CAAL 2015 361 / 378 « snnnnnnnnnnnn0000000000000000000 » ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 362 / 378 » ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn « snnnnnnnnnnnn0000000000000000000 Gabriel Laskar (EPITA) CAAL 2015 362 / 378 int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn « snnnnnnnnnnnn0000000000000000000 » ssssssssssssssssssssnnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 362 / 378 CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn « snnnnnnnnnnnn0000000000000000000 » ssssssssssssssssssssnnnnnnnnnnnn int sign_ext(int val, int bits) { int shift_bits= 32- bits; return (val << shift_bits) >> shift_bits; } Gabriel Laskar (EPITA) CAAL 2015 362 / 378 sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 363 / 378 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 Gabriel Laskar (EPITA) CAAL 2015 363 / 378 x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 363 / 378 x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 363 / 378 sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 363 / 378 x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 363 / 378 sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 363 / 378 int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn Gabriel Laskar (EPITA) CAAL 2015 363 / 378 CPU-aware optimizations Examples Sign extension Sign extension at an arbitrary size x 0000000000000000000snnnnnnnnnnnn sb 0000000000000000000s000000000000 xor 0000000000000000000Snnnnnnnnnnnn x 00000000000000000001nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000000nnnnnnnnnnnn .. - sb 11111111111111111111nnnnnnnnnnnn x 00000000000000000000nnnnnnnnnnnn sb 00000000000000000001000000000000 x ^ sb 00000000000000000001nnnnnnnnnnnn .. - sb 00000000000000000000nnnnnnnnnnnn int sign_ext(int val, int bits) { int high=1 << (bits-1); return (val^ high)- high; } Gabriel Laskar (EPITA) CAAL 2015 363 / 378 CPU-aware optimizations Examples Power of two detection 10: CPU-aware optimizations Rationale Examples abs() Sign extension Power of two detection Mask merging Code readability guidelines Gabriel Laskar (EPITA) CAAL 2015 364 / 378 I Lower bits up to the lowest 1 get flipped I Other bits stay the same int is_pow2(int n) { return !(n& (n-1)); } CPU-aware optimizations Examples Power of two detection Power of two Properties I Powers of two have only one “1” bit I Substracting 1 from a power of two flips the only “1” Examples: I 1110010 - 1 = 1110001 I 0100000-1=0011111 Gabriel Laskar (EPITA) CAAL 2015 365 / 378 int is_pow2(int n) { return !(n& (n-1)); } CPU-aware optimizations Examples Power of two detection Power of two Properties I Powers of two have only one “1” bit I Substracting 1 from a power of two flips the only “1” Examples: I 1110010 - 1 = 1110001 I 0100000-1=0011111 I Lower bits up to the lowest 1 get flipped I Other bits stay the same Gabriel Laskar (EPITA) CAAL 2015 365 / 378 CPU-aware optimizations Examples Power of two detection Power of two Properties I Powers of two have only one “1” bit I Substracting 1 from a power of two flips the only “1” Examples: I 1110010 - 1 = 1110001 I 0100000-1=0011111 I Lower bits up to the lowest 1 get flipped I Other bits stay the same int is_pow2(int n) { return !(n& (n-1)); } Gabriel Laskar (EPITA) CAAL 2015 365 / 378 CPU-aware optimizations Examples Mask merging 10: CPU-aware optimizations Rationale Examples abs() Sign extension Power of two detection Mask merging Code readability guidelines Gabriel Laskar (EPITA) CAAL 2015 366 / 378 int mask_merge(int mask, int where_0, int where_1) { return (mask& where_1)|(~mask& where_0); } CPU-aware optimizations Examples Mask merging Mask merging 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 mask 0 0 1 1 1 1 0 0 result 1 1 0 0 1 1 0 0 Gabriel Laskar (EPITA) CAAL 2015 367 / 378 CPU-aware optimizations Examples Mask merging Mask merging 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 mask 0 0 1 1 1 1 0 0 result 1 1 0 0 1 1 0 0 int mask_merge(int mask, int where_0, int where_1) { return (mask& where_1)|(~mask& where_0); } Gabriel Laskar (EPITA) CAAL 2015 367 / 378 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0 int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; } CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0 Gabriel Laskar (EPITA) CAAL 2015 368 / 378 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0 int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; } CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0 Gabriel Laskar (EPITA) CAAL 2015 368 / 378 int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; } CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0 Gabriel Laskar (EPITA) CAAL 2015 368 / 378 CPU-aware optimizations Examples Mask merging Mask merging Less instructions 01234567 where_0 1 1 1 0 1 0 0 0 where_1 1 0 0 0 1 1 1 0 where_0 ^ where_1 0 1 1 0 0 1 1 0 mask 0 0 1 1 1 1 0 0 (w0 ^ w1) &mask 0 0 1 0 0 1 0 0 ((w0 ^ w1) &mask) ^ w0 1 1 0 0 1 1 0 0 int mask_merge(int mask, int where_0, int where_1) { return ((where_0^ where_1)& mask)^ where_0; } Gabriel Laskar (EPITA) CAAL 2015 368 / 378 CPU-aware optimizations Code readability guidelines 10: CPU-aware optimizations Rationale Examples Code readability guidelines Gabriel Laskar (EPITA) CAAL 2015 369 / 378 CPU-aware optimizations Code readability guidelines Code readability If you ever code with this kind of hack I always create functions with explicit name and prototype I eventually document the intended behavior I may use a static inline int abs(int x); int sign_ext(int val, int bits); int is_pow2(int n); int mask_merge(int mask, int where_0, int where_1); Gabriel Laskar (EPITA) CAAL 2015 370 / 378 Multi-/Many-core, heterogeneous systems Part XI Multi-/Many-core, heterogeneous systems Gabriel Laskar (EPITA) CAAL 2015 371 / 378 Multi-/Many-core, heterogeneous systems Topology evolution 11: Multi-/Many-core, heterogeneous systems Topology evolution Challenges Gabriel Laskar (EPITA) CAAL 2015 372 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies CPU RAM Video Keyboard Floppy Sound Serial Network Hard Disk Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies CPU RAM Bridge Bridge Network Video Keyboard Sound Hard Disk Floppy Serial Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies CPU CPU RAM Bridge Bridge Network Video Keyboard Sound Hard Disk Floppy Serial Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies CPU CPU RAM Bridge Video Bridge Bridge Keyboard Network Sound Hard Disk Floppy USB Serial Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies CPU CPU RAM Bridge Video Bridge PCI ATA Floppy Net Kbd USB Mouse USB Sound Serial Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies CPU CPU RAM Bridge Video Bridge PCI ATA SATA Floppy Net Wlan Kbd USB Mouse USB Sound SD Serial Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies Many core Many core CPU CPU Video RAM Bridge PCI−e Bridge PCI ATA SATA Floppy Net Wlan Kbd USB Mouse USB Sound SD Serial Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution History of hardware topologies Many core Many core RAM RAM CPU CPU PCI−e Video Bridge PCI−e Bridge PCI ATASATA Floppy Net Wlan Kbd USB Mouse USB Sound SD Serial Gabriel Laskar (EPITA) CAAL 2015 373 / 378 Multi-/Many-core, heterogeneous systems Topology evolution Future of hardware topologies? GPGPU GPGPU Many core Many core CPU CPU RAM RAM GPGPU GPGPU Many core Many core CPU CPU RAM RAM GPGPU GPGPU Many core Many core CPU CPU RAM RAM Bridge PCI ATASATA Floppy Video Net Wlan Kbd USB Mouse USB Sound SD Serial Gabriel Laskar (EPITA) CAAL 2015 374 / 378 Multi-/Many-core, heterogeneous systems Topology evolution Hardware topologies Observations Busses don’t scale I Bandwidth is shared among connected peers I They need rest cycles between elections We don’t have busses any more I We use networks (QPI, HyperTransport) I This forbids snooping I Coherence has to be done explicitely Gabriel Laskar (EPITA) CAAL 2015 375 / 378 Multi-/Many-core, heterogeneous systems Topology evolution NUMA Observations I There are more and more cores I Memory gets closer to the cores I Memory connections get distributed among cores I Systems become NUMA Gabriel Laskar (EPITA) CAAL 2015 376 / 378 Multi-/Many-core, heterogeneous systems Challenges 11: Multi-/Many-core, heterogeneous systems Topology evolution Challenges Gabriel Laskar (EPITA) CAAL 2015 377 / 378 Multi-/Many-core, heterogeneous systems Challenges Some scalability bottlenecks Coherent shared-memory systems I are hard to design I dont scale well I get slower with the load NUMA systems I are hard to program for I are not well supported in all OSes Uncoherent shared-memory systems are not ready for prime-time yet Gabriel Laskar (EPITA) CAAL 2015 378 / 378