4.12.2014

Computer System Structures cz:Struktury počítačových systémů

Version: 1.0

Lecturer: Richard Šusta

ČVUT-FEL in Prague, CR – subject A0B35SPS

Microprocessors

2

1 4.12.2014

Definition:

 A Microprocessor is a Single VLSI Chip that has a CPU and may also have some Other Units (for example, caches, floating point processing arithmetic unit, pipelining, and super-scaling units)

Source Microprocessor Family Motorola 68HCxxx 80x86 & i860 Sun SPARC IBM PowerPC

Source: Raj Kamal, Embedded Systems: Architecture, Programming and Design 3

Existovalo a dodnes existuje mnoho typů od různých výrobců http://research.microsoft.com/en-us/um/people/gbell/CyberMuseum_contents/Microprocessor_Evolution_Poster.jpg

2 4.12.2014

From 8 bit to 32 bit CISC processors

8 bit µ-processor 8008 8080 8085 Z80 Year 1972 1974 1976 1976 Instructions 66 111 113 158 MIPS 0.06 0.29 0.37 0.58 Minimum system (IO circuits) 20-40 7 3 3

16 and 32 bit 64bit example: Intel 8086/88 80286 80386 80486 Pentium µ-processor I5-2500K Quad core Year 1978/79 1982 1985-91 1989-94 1993-98 2011 Instructions processor 133 ~175 ~195 ~200 ~225 ~500 + numeric cooprocessor MIPS 0.33-0.75 0.9-2.6 2.5-11 16-70 100-300 ~80000 Memory Physical / Virtual 1 MB 16 MB/1 GB 4 GB / 64TB 32 GB / 256 TB

Definition: Microcontroller

 A microcontroller is a single-chip VLSI unit which, though having limited computational capabilities possesses enhanced input- output capabilities and a number of on- chip functional units

Source Microcontroller Family Motorola 68HC11xx, HC12xx, HC16xx Intel 80x86, 8051, 80251 Microchip PIC 16F84, 16F876, PIC18 ARM ARM9, ARM7

Source: Raj Kamal, Embedded Systems: Architecture, Programming and Design 6

3 4.12.2014

Microcontroller

Output Output interfaces Inputintefaces In device1 Out device1 In device2 Microcontroller (uC) Out device2

In device3

Auxiliary units

7

Example: “Original” 8051 Microcontroller

Oscillator 4096 128 Bytes Two 16 Bit and timing Program Data Timer/Event Memory Memory Counters

8051 Internal data bus CPU

64 K Bus Programmable Programmable Expansion I/O Serial Port Full Control Duplex UART Synchronous Shifter subsystem interrupts

External interrupts Control Parallel ports Serial Output Address Data Bus Serial Input I/O pins Source: Raj Kamal, Embedded Systems: Architecture, Programming and Design 8

4 4.12.2014

Hardcore/Softcore Processors

 Hard-core Processors A Fabricated Integrated Circuit that may or may not be Embedded into Additional Logic  Soft-core Processors A Processor Described in a HDL that is implemented in Reconfigurable Logic (ie, in an FPGA), e.g NIOS, MicroBlaze, OpenRISC

SPS 9

Soft-core Processors Logic Elements

Ultra SPARC S1 Core/f 60000

S1 Core 37000 OpenRISC 1200 6000

LEON2 5000

Sun Microsystems Sun

LEON3 3500

ARM Cortex-M1 2600

aeMB 2536

source LatticeMico32 1984

OpenFire 1928 Open Nios II/f 1800 MicroBlaze 1324

Nios II/s 1170

DSPuva16 510

Nios II/e 390 PacoBlaze 204

LatticeMico8 200

Xilinx PicoBlaze 192 100 1000 10000 100000

SPS 10

5 4.12.2014

Nios II: RISC soft-core processor

Copyright © 2005 Corporation

Nios II Versions

 Nios II Processor Comes In Three ISA (Instruction Set Architecture) Compatible Versions

 FAST: Optimized for Speed

 STANDARD: Balanced for Speed and Size

 ECONOMY: Optimized for Size  Software  Code is Binary Compatible  No Changes Required When CPU is Changed

Copyright © 2005 Altera Corporation 12

6 4.12.2014

Nios II: Hard Numbers

Nios II/ f Nios II /s Nios II /e II 200 DMIPS @ 90 DMIPS @ 28 DMIPS @ 175MHz 175MHz 190MHz

1180 LEs 800 LEs 400 LEs Cyclone 100 DMIPS @ 62 DMIPS @ 20 DMIPS @ 125MHz 125MHz 140MHz 1800 LEs 1200 LEs 550 LEs

DMIPS = Dhrystone benchmark MIPS (million ) Dhrystone is without float point operations (Note: Whetstone [cz brousek] has float points) Comparing with Pentium (1996) 224 DMIPS @ 200MHz cca 196 DMIPS@175MHz - cca 140 DMIPS@125 MHz

Copyright © 2005 Altera Corporation 13

Processor Core Variations Nios II /f Nios II /s Nios II /e Fast Standard Economy

Pipeline 6 Stage 5 Stage None

H/W Multiplier & Emulated 1 Cycle 3 Cycle Barrel Shifter In Software

Branch Prediction Dynamic Static None

Instruction Cache Configurable Configurable None

Data Cache Configurable None None

Logic Requirements 1800 w/o MMU 1200 600 (Typical LEs) 3200 w/ MMU

Custom Up to 256 Instructions

© 2010 Altera Corporation—Public (MMU - ) ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 14

7 4.12.2014

Mutlicycle CPU

Start State 0 1 Instr. Instr. fetch/ decode/reg. adv. PC fetch/branch addr. load Reg Jump Branch 3 2

Read Compute ALU Write PC Write memory memory operation on branch jump addr. data addr. condition to PC 6 9 4 store 8 5 7 Write Write memory Write register data register

15 SPS 15

Different Execution Time of Instructions instruction data registers ALU registers memory memory

Fastest

Not Not Not Jump P Not used used used used

Not Not P Not Branch C used used used

P Not Compute C used

P Load C

Not P used Store C Not

SlowestSPS 16

8 4.12.2014

Single-cycle versus multicycle instruction execution.

Clock

Time needed

Time allotted Instr 1 Instr 2 Instr 3 Instr 4

Clock

Time Time needed saved 3 cycles 5 cycles 3 cycles 4 cycles Time allotted Instr 1 Instr 2 Instr 3 Instr 4

SPS 17

Internal Registers in CPUs

Processor Memory Input/Output PC IR

AC Address Bus MAR Data Bus ALU MDR PC Program counter Control Unit AC Accumulator Internal registers IR Instruction register MAR Memory address register MDR Memory data register

SPS 18

9 4.12.2014

Cycles of Instructions

Store Fetch IR MAR MDR Result Operands Fetch Decode Execute Instruction Opcode Operation

SPS 19

Pipelining

 Řízení procesoru s pipelining (zřetězeným) zpracováním instrukcí není jednoduché. Autoři procesoru MIPS, podobného NIOSu, uvádějí ve své knize (Paterson, D., Henessy, J.: Computer Organization and Design, Elsevier), že jimi navržená jednotka pro pipelining úspěšně otestována nejméně 100 lidmi na 18 různých pracovištích obsahovala chybu ztracenou v několika tisících řádkách HDL kódu, na kterou se přišlo až delším používání. Více bude v předmětu A0B36APO Architektura počítačů

SPS 20

10 4.12.2014

Minimum Nios II Processor

Nios II Processor Core reset Instruction General clock Program Master Purpose Port Controller Registers & Address Generation Status & Control Registers

Data Master Arithmetic Port Logic Unit

= Configurable = Fixed

© 2010 Altera Corporation—Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 21

Outside NIOS

Image source: Tron

11 4.12.2014

Simple NIOS System on DE2

Nios Processor Address (32) Switch PIO Read

32-Bit AvalonBus Nios Write LED PIO Processor Data In (32) Data Out (32) 7-Segment

LED PIO

PIO-32

UART Timer User-Defined Interface

23

Nios II/e on Basic Computer for DE2

Copyright © 2005 Altera Corporation 24

12 4.12.2014

NIOS DE2 Basic Computer

Start-address End-address Element Size 0x00000000 0x007FFFFF SDRAM 8 MB Ram

128 MB

--- 120 MB

MB 0x08000000 0x0807FFFF ROM (SRAM) 512 kB read-only 16 MB

256 256 15,5 MB 0x09000000 0x09001FFF On-chip-RAM 8 kB 112 MB 111,92 MB 0x10000000 0x10000003 LEDR 4 bytes-write-only

0x10000010 0x10000013 LEDG 4 bytes-write-only

0x10000020 0x10000023 HEX3_HEX0 4 bytes-write-only

0x10000030 0x10000033 HEX7_HEX4 4 bytes-write-only

0x10000040 0x10000043 SW 4 bytes-read-only

4 bytes data 0x10000050 0x1000005F KEY_BASE +12 bytes control

.. other I/O 25

Memory Map of Basic Computer  SDRAM - synchronous dynamic RAM, i.e. synchronized with the system bus, on the DE2 board - 1M x 16 bits x 4 banks - addresses 0x00000000 to 0x007FFFFF.  SRAM - static RAM (SRAM) chip, on the DE2 board - 256K x 16 bits - addresses space 0x08000000 to 0x0807FFFF.  On-Chip Memory 8-Kbyte memory inside the Cyclone II FPGA chip - 2K x 32 bits - addresses 0x09000000 to 0x09001FFF.

26 Copyright © 2005 Altera Corporation

13 4.12.2014

Nios II/s: DE2 Media Computer

Copyright © 2005 Altera Corporation 27

NIOS DE2 Media Computer

Start-address End-address Element Size 0x00000000 0x007FFFFF SDRAM 8 MB synchron. DRAM

128 MB

--- 120 MB

MB 0x08000000 0x0807FFFF SRAM 512 kB VGA pixel buffer 16 MB

256 256 15,5 MB 0x09000000 0x09001FFF On-chip-RAM 8 kB VGA char. buffer 112 MB 111,92 MB 0x10000000 0x10000003 LEDR 4 bytes-write-only

0x10000010 0x10000013 LEDG 4 bytes-write-only

0x10000020 0x10000023 HEX3_HEX0 4 bytes-write-only

0x10000030 0x10000033 HEX7_HEX4 4 bytes-write-only

0x10000040 0x10000043 SW 4 bytes-read-only

4 bytes data 0x10000050 0x1000005F KEY_BASE +12 bytes control 28 .. other I/O 28

14 4.12.2014

I/O DE2 Basic & Media Computer

0x10000000 LEDR_BASE 0x10000010 LEDG_BASE, 0x10000020 HEX3_HEX0_BASE 0x10000030 HEX7_HEX4_BASE 0x10000040 SW_BASE 0x10000050 KEY_BASE ...others I/O...

Literatura on DVD: DE2_Media_Computer.pdf 29

Switch and Push Button

SW_BASE

KEY_BASE

30

15 4.12.2014

LED

LEDR_BASE

LEDG_BASE

31

HEX

HEX3_HEX0_BASE

HEX7_HEX4_BASE

32

16 4.12.2014

Wiki: Memory Mapped IO

Memory-mapped I/O (MMIO) and port I/O (also called isolated I/O or port-mapped I/O abbreviated PMIO) are two complementary methods of performing input/output between the CPU and peripheral devices in a computer. An alternative approach, not discussed here, is using dedicated I/O processors — commonly known as channels on mainframe computers — that execute their own instructions. Memory-mapped I/O (not to be confused with memory-mapped file I/O) uses the same address bus to address both memory and I/O devices ‒ the memory and registers of the I/O devices are mapped to (associated with) address values. So when an address is accessed by the CPU, it may refer to a portion of physical RAM, but it can also refer to memory of the I/O device. Thus, the CPU instructions used to access the memory can also be used for accessing devices. Each I/O device monitors the CPU's address bus and responds to any CPU access of an address assigned to that device, connecting the data bus to the desired device's hardware register. To accommodate the I/O devices, areas of the addresses used by the CPU must be reserved for I/O and must not be available for normal physical memory. The reservation might be temporary — the Commodore 64 could bank switch between its I/O devices and regular memory — or permanent. Port-mapped I/O often uses a special class of CPU instructions specifically for performing I/O. This is found on Intel , with the IN and OUT instructions. These instructions can read and write one to four bytes (outb, outw, outl) to an I/O device. I/O devices have a separate address space from general memory, either accomplished by an extra "I/O" pin on the CPU's physical interface, or an entire bus dedicated to I/O. Because the address space for I/O is isolated from that for main memory, this is sometimes referred to as isolated I/O. [ http://en.wikipedia.org/wiki/Memory-mapped_I/O ]

33

Inside NIOS

Nios II UART

CPU

Debug Cache GPIO

On-Chip Timer ROM SPI

On-Chip Fabric Switch Avalon SDRAM RAM Controller

FPGA

17 4.12.2014

NIOS Simplified System

.text

.data

35

NIOS: Common Register Usage

Reg Name Normal usage Reg Name Normal usage r0 zero 0x0000_0000 r16 r1 at Assembler Temporary r17 r2 r18 r3 r19 r4 r20 r5 r21 r6 r22 r7 r23 r8 r24 et Exception Temporary r9 r25 bt Break Temporary r10 r26 gp Global Pointer r11 r27 sp Stack Pointer r12 r28 fp Frame Pointer r14 r29 ea Exception Return Address r14 r30 ba Break Return Address r15 r31 ra Return Address 36

18 4.12.2014

NIOS Assembler

is GNU assembler http://tigcc.ticalc.org/doc/gnuasm.html http://sources.redhat.com/binutils/docs-2.12/as.info/ by Altera monitor – lieratura on DVD • Doporuceno_Altera_Monitor_Program_Tutorial.pdf • n2cpu_nii5v1.pdf – instruction set reference

37

General Processor Instruction Formats

 Three-Address Instructions  ADD o1, o2, o3 o1 ← o2 + o3  Two-Address Instructions  MOV o1, o2 o1 ← o2  One-Address Instructions  NEXTPC o1 o1 ← PC + 4  Zero-Address Instructions  NOP r1 ← r1 add r0 (NIOS)

 R-Type - operands oi are registers only  I-Type - instruction contains an immediate value  C-Type - control instruction

38

19 4.12.2014

NIOS Addressing Modes

(a) Register direct addressing Op R1 mov R2,R1 Register contains the operand R2  R1 R2 Operand

(b) Immediate addressing Op -20 movi r2, -20 Instruction contains the operand stw R2, byte_offset(R1)

(c) Displacement (or offset) stw R2, 100(R1) : R2 → M[R1+100] addressing Memory M Address of operand = Op R2 100 register + constant

Op+ Operand

R1 Address

39

RISC vs CISC - Pentium Addressing Modes Only for self-study

LA=linear address ~ EA [Source: Intel co.] 40

20 4.12.2014

Load/Store ldw (load 32-bit word from memory) ldw rB, im-const(rA) : rB ← MEM[rA + im-const] stw (store 32-bit word into memory)

stw rB, im-const (rA) : rB → MEM[rA + im-const]

41

Example - demo1.s /*.include "nios_macros.s"*/ .equ LEDR_BASE, 0x10000000 .equ SW_BASE, 0x10000040

.text/* executable code follows */ .global_start _start: /* initialize base addresses of parallel ports */ movia r10, SW_BASE/* SW slider switch base address */ movia r12, LEDR_BASE/* red LED base address */

LOOP: ldwio r8, 0(r10) Pseudo-instruction stwio r8, 0(r12) movia r12, 0x10000040 br LOOP => orhi r12, r0, 0x1000 .end addi r12, r0, 0x0040

42

21 4.12.2014

Some GNU assembler directives .include insert another file .text this directive tells the assembler to place the following statements at the end of the code section .global this directive makes the symbol visible to the program loading instructions into memory. .data this directive tells the assembler to place the following statements at the end of the data section. .equ this directive set the value of a symbol. .word this directive is used to set the content of memory locations to the specified value. .end Marks the end of current assembly program.

43

Data Types in Instructions NIOS II registers hold 32-bit (4-byte) words.

Other common data sizes include byte and halfword. Byte = 8 bits - example: ldb / ldbu + ldbio / ldbuio - load byte extend signed/unsigned

Halfword = 2 bytes example: - ldh / ldhu + ldhio / ldhuio - load halfword extend signed/unsigned

Word = 4 bytes - example: - ldw + ldwio - load word

Doubleword = 8 bytes - user defined instructions only

u-unsigned extension io - bypass cashe 44

22 4.12.2014

Unsigned/Signed Extension

X • • • 0

• • •

Xu' • • • • • • k m unsigned m X • • •

• • •

X  • • • • • • k m signed 45

ALU Instructions operation R-format I-format add add addi subtract sub subi Lower 32bits multiply mul muli of 64bit result High 32bits mulxuu,mulxsu of 64bits mulxss High 32bits of 64bit result AND and andi andhi OR or ori orhi XOR xor xori xorhi NOR nor

pseudo-instruction addi rB ← rA + se(number) Logic instructions AND, OR, XOR, NOR do not use se = sign-extension 46

23 4.12.2014

Increasing Performance

Image source: Tron

3 Ways to Increase Performance

Custom Hardware Multi-Processor Instructions Accelerators System

FPGA FPGA FPGA Nios II

Nios II Nios II External Nios II Custom Hardware CPU or

Instructions Accelerator DSP PCIe Nios II

 Accelerate CPU  Add external  Add more processing co-processing processors performance with hardware to (internal &/or application- accelerate data external) to increase specific hardware functions processing power

© 2010 Altera Corporation—Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. 48 and Altera marks in and outside the U.S.

24 4.12.2014

Accelerating Software in FPGA

Custom Logic

2,500 A 2,000 +- << C - Out 1,500 530 Times >> 27 Times Faster & 1,000 B Faster 500 Nios II Embedded Processor

Iterations/Second 0 Software Custom Accelerator Control

Only Instruction Nios II

Custom Accelerator DMA • Accelerator running 64Kb CRC at 100 MHz Instruction DMA (CRC can be easily calculated on a Linear Feedback Shift Register, see the next lecture)

Arbiter Arbiter

Program Data Data Memory Memory Memory

© 2010 Altera Corporation—Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 49

Altera C2H – the SW Accelerator Solution

 Generates and integrates a custom hardware accelerator from an ANSI C function.

C2H CPU Accelerator

Arbiter Arbiter

Program Data Data Memory Memory Memory

© 2010 Altera Corporation—Public ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. and Altera marks in and outside the U.S. 50

25 4.12.2014

Expander created from modified VGAFrame

OUTPUT LEDR[17] show bits of datain regions VCC OUTPUT LEDR[15..12]

OUTPUT LEDR[9..0] VCC

OUTPUT LEDR[16] altpll0 OUTPUT LEDR[11..10] GND

Freq25MHz175 GND INPUT inclk0 c0 CLOCK_27 VCC inclk0 frequency: 27.000 MHz Operation Mode: Normal locked

Clk Ratio Ph (dg) DC (%) c0 1007/1080 0.00 50.00 OUTPUT TD_RESET inst Cyclone II VCC

VGAsynchro VGANiosRectangle

CLK_25MHz175 yrow[9..0] yrow[9..0] VGA_R[9..0] OUTPUT VGA_R[9..0] ACLRN xcolumn[9..0] xcolumn[9..0] VGA_G[9..0] OUTPUT VGA_G[9..0] VGA_CLK VGA_CLK_in VGA_B[9..0] OUTPUT VGA_B[9..0] VGA_BLANK VGA_BLANK_in VGA_CLK OUTPUT VGA_CLK VGA_HS VGA_HS_in VGA_BLANK OUTPUT VGA_BLANK VGA_VS VGA_VS_in VGA_HS OUTPUT VGA_HS VGA_SYNC VGA_SYNC_in VGA_VS OUTPUT VGA_VS ACLRN_in VGA_SYNC OUTPUT VGA_SYNC inst4 datain[17..0] tb_yrow[9..0] tb_xcolumn[9..0] INPUT SW[17..0] VCC inst1 datain(17) - enable write; datain(15 downto 12) - memory index 51 (0-x0, 1-y0, 2-xwidth, 3-yheight; 4 to 6 - RGB color of background, 8 to 10 - RGB color of the rectangle)

VGANiosRectangle

VGA_R[9..0]

yrow[9..0] frame:process(VGA_CLK) yheight x0,y0 VGA_G[9..0] xcolumn[9..0]

xwidth VGA_B[9..0] VGA_CLK paramaters of picture 0 1 2 3 4 5 6 8 9 10

ACLRN x0 y0 xw yh R G B Rb Gb Bb subtype datain_t is unsigned(9 downto 0); datain[17..0] type memory_t is array(0 to 15) of datain_t; signal memory : memory_t;; store:process(VGA_CLK, ACLRN)

52

26 4.12.2014

Top entity of Basic Computer

53

DE2 Basic Computer Schema Detail

DE2_Basic_Computer

CLOCK_50 INPUT BIDIR GPIO_0[35..0] VCC CLOCK_50 GPIO_0[35..0] VCC CLOCK_27 INPUT BIDIR GPIO_1[35..0] VCC CLOCK_27 GPIO_1[35..0] VCC KEY[3..0] INPUT BIDIR SRAM_DQ[15..0] VCC KEY[3..0] SRAM_DQ[15..0] VCC SW[17..0] INPUT BIDIR DRAM_DQ[15..0] VCC SW[17..0] DRAM_DQ[15..0] VCC UART_RXD INPUT OUTPUT LEDG[8..0] VCC UART_RXD LEDG[8..0] LEDR[17..0] OUTPUT LEDR[17..0] HEX0[6..0] OUTPUT HEX0[6..0] HEX1[6..0] OUTPUT HEX1[6..0] HEX2[6..0] OUTPUT HEX2[6..0] HEX3[6..0] OUTPUT HEX3[6..0] HEX4[6..0] OUTPUT HEX4[6..0] HEX5[6..0] OUTPUT HEX5[6..0] HEX6[6..0] OUTPUT HEX6[6..0] HEX7[6..0] OUTPUT HEX7[6..0] SRAM_ADDR[17..0] OUTPUT SRAM_ADDR[17..0] SRAM_CE_N OUTPUT SRAM_CE_N SRAM_WE_N OUTPUT SRAM_WE_N SRAM_OE_N OUTPUT SRAM_OE_N 54 SRAM_UB_N OUTPUT SRAM_UB_N SRAM_LB_N OUTPUT SRAM_LB_N UART_TXD OUTPUT UART_TXD DRAM_ADDR[11..0] OUTPUT DRAM_ADDR[11..0] DRAM_BA_1 OUTPUT DRAM_BA_1 DRAM_BA_0 OUTPUT DRAM_BA_0 DRAM_CAS_N OUTPUT DRAM_CAS_N DRAM_RAS_N OUTPUT DRAM_RAS_N 27 DRAM_CLK OUTPUT DRAM_CLK DRAM_CKE OUTPUT DRAM_CKE DRAM_CS_N OUTPUT DRAM_CS_N DRAM_WE_N OUTPUT DRAM_WE_N DRAM_UDQM OUTPUT DRAM_UDQM DRAM_LDQM OUTPUT DRAM_LDQM

inst 4.12.2014

Connected Accelerator

altpll0

Freq25MHz175 inclk0 c0 inclk0 frequency: 27.000 MHz DE2_Basic_Computer Operation Mode: Normal locked

CLOCK_50 INPUT BIDIR GPIO_0[35..0] VCC CLOCK_50 GPIO_0[35..0] VCC Clk Ratio Ph (dg) DC (%) CLOCK_27 INPUT BIDIR GPIO_1[35..0] VCC CLOCK_27 GPIO_1[35..0] VCC c0 1007/1080 0.00 50.00 KEY[3..0] INPUT KEY[3..0] SRAM_DQ[15..0] BIDIR SRAM_DQ[15..0] VCC VCC OUTPUT TD_RESET INPUT BIDIR

SW[17..0] DRAM_DQ[15..0] VCC VCC SW[17..0] DRAM_DQ[15..0] VCC inst2 Cyclone II UART_RXD INPUT OUTPUT LEDG[8..0] VCC UART_RXD LEDG[8..0] LEDR[17..0] HEX0[6..0] OUTPUT HEX0[6..0] HEX1[6..0] OUTPUT HEX1[6..0] VGAsynchro VGANiosRectangle HEX2[6..0] OUTPUT HEX2[6..0] CLK_25MHz175 yrow[9..0] yrow[9..0] VGA_R[9..0] OUTPUT VGA_R[9..0] HEX3[6..0] OUTPUT HEX3[6..0] ACLRN xcolumn[9..0] xcolumn[9..0] VGA_G[9..0] OUTPUT VGA_G[9..0] HEX4[6..0] OUTPUT HEX4[6..0] VGA_CLK VGA_CLK_in VGA_B[9..0] OUTPUT VGA_B[9..0] HEX5[6..0] OUTPUT HEX5[6..0] VGA_BLANK VGA_BLANK_in VGA_CLK OUTPUT VGA_CLK HEX6[6..0] OUTPUT HEX6[6..0] VGA_HS VGA_HS_in VGA_BLANK OUTPUT VGA_BLANK HEX7[6..0] OUTPUT HEX7[6..0] VGA_VS VGA_VS_in VGA_HS OUTPUT VGA_HS SRAM_ADDR[17..0]OUTPUT SRAM_ADDR[17..0] VGA_SYNC VGA_SYNC_in VGA_VS OUTPUT VGA_VS SRAM_CE_N OUTPUT SRAM_CE_N ACLRN_in VGA_SYNC OUTPUT VGA_SYNC SRAM_WE_N OUTPUT SRAM_WE_N datain[17..0] tb_yrow[9..0] SRAM_OE_N OUTPUT SRAM_OE_N inst4 tb_xcolumn[9..0] SRAM_UB_N OUTPUT SRAM_UB_N SRAM_LB_N OUTPUT SRAM_LB_N inst1 UART_TXD OUTPUT UART_TXD DRAM_ADDR[11..0]OUTPUT DRAM_ADDR[11..0] DRAM_BA_1 OUTPUT DRAM_BA_1 OUTPUT LEDR[17..0] DRAM_BA_0 OUTPUT DRAM_BA_0 DRAM_CAS_N OUTPUT DRAM_CAS_N DRAM_RAS_N OUTPUT DRAM_RAS_N DRAM_CLK OUTPUT DRAM_CLK DRAM_CKE OUTPUT DRAM_CKE DRAM_CS_N OUTPUT DRAM_CS_N DRAM_WE_N OUTPUT DRAM_WE_N DRAM_UDQM OUTPUT DRAM_UDQM DRAM_LDQM OUTPUT DRAM_LDQM

inst

55

E(x)ternal Influence Q: – How to transfer signals depended on different clocks? A: If you look at the picture you will see the key to this solution: You just need second hands for automatic shaking:-)

[Image: 9gag.com] 56

28 4.12.2014

Synchronize Signal with 2nd Clock

Clk1 Clk2 2 unsynchronized clocks

Signal- data Hand clk1 RDY Snake source new data are ready

communication channel

Send without implicit hand-shaking (cz: nepotvrzovaný přenos dat) It is similar to UDP on networks (UDP - User Datagram Protocol) 57

Handshaking

data Data

Signal- Handshake RDY register

clk1 FSM

source SLOAD

clk=5kHz Clk1 ACK Clk2 a signal synchronized - acknowledge by the 1st clock read by hand-shaking (cz: kladně potvrzovaný přenos) like TCP = Transmission Control Protocol

58

29 4.12.2014

Handshake with stable data data Signal- request Register clk1 Hand RDY SLOAD source Clk1 shake ACK clk2

clk1 data 3 1 1 1 1 request 2

RDY 7 4 4 5 5 5 ACK 6 59 RDY is synchronized with clk1, but not with clk2

FSM for Handshaking Clock1 Moore ACK RDY Clk1-signal='0' Machine

S0 Clk1-signal ='0' and ACK='0'

Clk1-signal ='1' ACK='0'

ACK='1' S S1 2 RDY

Clk1-signal ='1' or ACK='1'

30 4.12.2014

Handshaking for unstable data

data init1 Handshake Signal- FSM Shift clk1 RDY register source

SLOAD

clk1 ACK - acknowledge clk2

61

Handshake with unstable data init0 init1 Signal- Hand Shift RDY clk1 KEY request shake SLOAD source clk1 ACK clk2

clk1 init0 1 1 1 1 request 2 init1 3 RDY 7 4 4 5 5 5 ACK 6 62 RDY is synchronized with clk1, but not with clk2

31 4.12.2014

Handshake for unstable data (1/4)

Input ports ENTITY handshake is Output of ENTITY ARCHITECTURE rtl of handshake is ports of ENTITY

clock data1 <=dataReg;

aclrn data1 data0 signal loadData signal

dataReg

process signal ack stNext StateRegister

request rdy

process NextState

signal stReg

SPS 63

Handshake for unstable data (2/4)

LIBRARY ieee; USE ieee.std_logic_1164.all; L:Libraries ENTITY handshake is PORT( ack, request : IN STD_LOGIC; data0 : in std_logic_vector(15 downto 0); clock, aclrn : IN STD_LOGIC; EP: entity-ports rdy : OUT STD_LOGIC; data1 : out std_logic_vector(15 downto 0) ); end handshake; ARCHITECTURE rtl of handshake is type state is (s0, s1, s2); AD: architecture signal stReg, stNext : state; declarations signal dataReg:std_logic_vector(15 downto 0); signal loadData:std_logic; begin ACS: architecture concurrent StateRegister: PROCESS(clock, aclrn) statements There are 3 BEGIN...END PROCESS; concurrent statements: NextState: PROCESS(stReg, ack, request) StateRegister:process, BEGIN...END PROCESS; NextState:process, and data1<=dataReg; <= end rtlSPS; 64

32 4.12.2014

Handshake for unstable data (3/4) StateRegister: PROCESS(clock, aclrn) PH : process - header with sensitivity list begin if (aclrn = '0') then stReg <= s0; dataReg<=(others=>'0'); elsif rising_edge(clock) PC: process-clock test then stReg <= stNext; if loadData='1' PS: process then dataReg<=data0; -synchronous part end if; end if; end PROCESS;

SPS 65

Handshake for unstable data (4/4)

NextState: PROCESS(stReg, ack, request) PH: process-header with begin sensitivity list rdy<='0'; loadData <= '0' ; case stReg is when s0 => if request = '1' then stNext <= s1;loadData <= '1' ; else stNext <= s0; end if; when s1 => rdy<='1'; PD:= process- if ack = '1' then stNext <= s2; dataflow else stNext <= s1; end if; part when s2 => if request = '0' and ack = '0' then stNext <= s0; else stNext <= s2; end if; when others => report "Undefined state"; end case; end PROCESS;

SPS 66

33 4.12.2014

RTL

dataReg[15..0] PRE data0[15..0] D Q data1[15..0] stReg 0 0 ENA ack ack NextState 1 CLR clk s0 NextState:OUT0 loadData request s1 rdy request aclrn reset

clock

s0 s1 s2 stReg~1

SPS 67

Příjem signálu handshake SLOAD DFFE-registry D Q DATA Data- EN synchronní CLK1 CLK z clk2 ACK CLRN DFF DFF RDY D Q D Q

CLK CLK CLRN CLRN Asynchronous clear active-low

Clk2 Obvod Synchronizátor není nutný, potvrzení ale velmi doporučený

SPS 68

34 4.12.2014

NIOS gcc

programming in low-level C language

69

Basic Steps of C Compiler

stdio.h string.h code.C code.h

C preprocessor modifies source code by substitutions

code.tmp optional NIOS gcc C-compiler passes 1 to M optional parts parts metacode

Compiler passes M to N: Optimization of metacode

system relative object libraries module NIOS Linker Loader processor

70 70

35 4.12.2014

C primitive types Size Java C C alternative Range 1 boolean any integer, true if !=0 BOOL(1 0 to !=0 8 byte char(2 signed char –128 to +127 8 unsigned char BYTE(1 0 to 255 16 short int signed short –32768 to +32767 16 unsigned short 0 to + 65535 32 int int signed int -2^31 to 2^31-1 32 unsigned int DWORD(1 0 to 2^32-1 64 long long long int -2^63 to 2^63-1 64 unsigned long LWORD(1 0 to 2^64-1 1) In many implementations, it is not a standard C datatype, but only common custom for user's "#define" macro definitions, see next slides

2) Default is signed but it is best to specify. 71

Sign Extension Example in C

short int x = 15213; int ix = (int) x; short int y = -15213; int iy = (int) y;

Decimal Hex Binary x 15213 3B 6D 00111011 01101101 ix 15213 00 00 C4 92 00000000 00000000 00111011 01101101 y -15213 C4 93 11000100 10010011 iy -15213 FF FF C4 93 11111111 11111111 11000100 10010011

72

36 4.12.2014

NIOS C #define / const

 const int TEN=10; read only access const ~ static final in Java

const int TEN2 = TEN; /* Error: initializer element is not constant Note: OK in C++, but C variables must not initialize constants * /

 #define TENDEF 10 //substitution rule

 const int TEN3 = TENDEF; // OK /* after preprocessor run: const int TEN3 = 10; */

73

Definition of BYTE and BOOL

// by substitution rule no ; and no type check  #define BYTE unsigned char  #define BOOL int

// by introducing new type, ending ; is required  typedef unsigned char BYTE;  typedef int BOOL;

C language has no strict type checking #define ~ typedef, but typedef is usually better integrated into compiler. 74

37 4.12.2014

Bitwise Logic Instructions Examples:  AND & n = n & 0xF0;  OR | n = n | 0xFF00;  XOR ^ n = n ^ 0x80;  left shift << n = 0xFF << 4;  right shift >> n = n >> 4  NOT ~ n = ~0xFF;

In C, operator << usually behaves as logical for unsigned operand and as arithmetic shift for signed operands. 75

NIOS Shift Instructions

shift dir Register Immediate note logical left sll slli C << ; *2 logical right srl srli C >> ; unsigned/2 arithmetic right sra srai signed / 2 rotate left rol roli barrel shifter rotate right ror --- barrel shifter

sll rC, rA, rB rC ← rA << (rB[4..0])

Immediate slli rC, rA, sh rC ← rA << (sh)

76

38 4.12.2014

NIOS and C

1. ORI r4, r2, 0x30 2. ANDI r4, r2, 0x0F 3. XORI r4, r2, 0b10000000 4. NOR r4, r2, r0 /* r4←not r2 */ 5. SLLI r4, r2, 2 6. SRLI r4, r2, 3 7. SRAI r4, r2, 4 8. Rotations ROR and ROL has no direct C equivalents.

int x4, x2=-10; 1. x4 = x2 | 0x30 ; 2. x4 = x2 & 0x0F ; 3. x4 = x2 ^ 0b10000000; 4. x4 = ~x2 5. x4 = x2<<2; 6. x4 = (unsigned int) x2>>3 7. x4 = (signed int) x2>>3; 77

Example - demo2.s

/*.include "nios_macros.s"*/ .equ LEDR_BASE, 0x10000000 .equ SW_BASE, 0x10000040

.text/* executable code follows */ .global_start _start: /* initialize base addresses of parallel ports */ movia r10, SW_BASE/* SW slider switch base address */ movia r12, LEDR_BASE/* red LED base address */

LOOP: ldwio r8, 0(r10) slli r8, r8, 1 stwio r8, 0(r12) br LOOP

.end

78

39 4.12.2014

volatile keyword int main(void) optimized { int x4, x2= -10; //0xfffffff8 code x4 = x2<<2; x4 = (unsigned int) x2>>3; // x4 = 0x1ffffffe x4 = (signed int) x2>>3; // x4=0xfffffffe }

int main(void) { // no assumption about value volatile volatile int x4, x2= -10; winged animal from plural x4 = x2<<2; of Latin volatilis x4 = (unsigned int) x2>>3; = winged x4 = (signed int) x2>>3;

} 79 79

Volatile

 When the keyword volatile is specified for a variable, it gives a hint to the compiler not to do certain optimizations as this variable can change from other parts of the program unexpectedly.  What is meant here, is that the compiler should not reuse the value already loaded in a register, but access the memory again as the value in register is not guaranteed to be the same as the value stored in memory.  All addresses of IO devices should be volatile. [Source: C-reference]

80

40 4.12.2014

Pointer Operators  & (address operator) Returns the address of its operand Example int y = 5; int *yPtr; yPtr = &y; // yPtr gets address of y yPtr “points to” y yptr 500000 600000 y 5 yPtr address of y is value of yptr y 600000 5

81

Pointer Operators

 & (address operator) Returns the address of its operand  * dereference address Get operand stored in address location  * and & are inverses (though not always applicable) Cancel each other out *&myVar == myVar and &*yPtr == yPtr

82

41 4.12.2014

Size of Pointer in C-kod int * ptri; ptri - char * ptrc; double * ptrd; ptri+1

*ptrx ≡ ptrx[0] ptrc

*(ptrx+1) ≡ ptrx[1] ptrc+1 *(ptrx+n) ≡ ptrx[n] ptrd *(ptrx-n) ≡ ptrx[-n]

nr1 = sizeof (double); nr2 = sizeof (double*); ptrd+1 + nr1 != nr2 83

Edge Detector – Full Source Code .equ IOBASE, 0x10000000 .equ LEDG, 0x10 Memory Address .equ SW, 0x40 .text = C pointer

.global _start type * pointer; _start: movia r8, IOBASE mov r18, zero LOOP: Dereferencing ldwio r15, SW(r8) Pointer nor r9, zero, r17 /* r9:=not r17*/ and r9, r9, r16 * (pointer+ix) and r9, r9, r15 cmpne r9, r9, zero ≡ pointer[ix] xor r18, r18, r9 stwio r18, LEDR(r8) mov r17, r16 mov r16, r15 br LOOP .end 84

42 4.12.2014

Char [ ]

C-kod: text: ’H’ 0 char text[] = ”Hello World!”; ’a’ 1 ’l’ 2 ’l’ 3 /* text = &text[0] */ text[4] ’o’ 4 ’ ’ 5 ’W’ 6 text[7] ’o’ 7 Assembler: ’r’ 8 ’l’ 9 .data ’d’ 10 ’!’ 11 text: .asciz ”Hello World!” 0x00 12

85

int [ ]

C-kod: int vect[] = {1, 10, 0xA};

/* vect = &(vect[0]) */ vect: 0x01 0 0x00 1 0x00 2 0x00 3 Assembler: *(vect+1) ≡ vect[1] 0x0A 4 0x00 5 .data 0x00 6 0x00 7 .align 2 *(vect+2) ≡ vect[2] 0x0A 8 0x00 9 vect: .word 1, 10, 0xA 0x00 10 #little endian byte ordering 0x00

86

43 4.12.2014

Pointer and const int x;  int * lpio = 0x10000000; *lpio = 1; x=*lpio; lpio++;  const int * lpCio = 0x10000000; *lpCio = 1; x=*lpCio; lpCio++;  int * const lpioC = 0x10000000; *lpioC = 1; x=*lpioC; lpioC++;  const int * const lpCioC = 0x10000000; *lpCioC = 1; x=*lpCioC; lpCioC++;

87

VGARectangle

DE2 Computer Extender

44 4.12.2014

C-program //#define DEBUG volatile int * red_LED_ptr = (int *) 0x10000000; volatile int * SW_switch_ptr = (int *) 0x10000040; void WrData(int address, int data); // WrData to extender int Count(int * counter, int * updown, int max); // increment/decrement counter on updown bits int main(void) { int x0=0, y0=0, xwidth=1, yheight=1;WrData(0, x0); WrData(1, y0); WrData(2, xwidth); WrData(3, yheight); // initilize extender while(1) { int SW_value = *(SW_switch_ptr); // read the SW slider switch values if(Count(&x0, &SW_value, 640)) WrData(0, x0); if(Count(&y0, &SW_value, 480)) WrData(1, y0); if(Count(&xwidth, &SW_value, 640)) WrData(2, xwidth); if(Count(&yheight, &SW_value, 480)) WrData(3, yheight); int delay_count = 500000; while(delay_count) --delay_count; // delay loop } } 89

Functions // Write on Address of extender data void WrData(int address, int data) { int send = (data & 0x3FF) | ((address&3)<<14); // address 15. and 14. bit + data *red_LED_ptr=send; *red_LED_ptr=(send | 0x20000); // toggle 17. bit int delay=10; while(delay>0) delay--; *red_LED_ptr=send; // reset toggle #ifdef DEBUG printf("%d:%d\t",address, data); #endif } // increment/decrement counter on updown bits int Count(int * counter, int * updown, int max) { int ud=*updown, cntr=*counter, cntr0=cntr; if((ud & 1) != 0 && cntr>0) cntr--; if((ud & 2) != 0 && cntr>2; *counter=cntr; return cntr!=cntr0; // move to next value } 90

45 4.12.2014

Edge Detection in C

Optimal / Non-optimal code

This parts demonstrates links between assembler and C program

46 4.12.2014

NIOS: Common Register Usage in C

Reg Name Normal usage Reg Name Normal usage r0 zero 0x0000_0000 r16 Callee-Saved r1 at Assembler Temporary r17 General-Purpose Registers r2 Return Value (least-significant 32 bits) r18 r3 Return Value (most significant 32 bits) r19 r4 Register Arguments (First 32 bits) r20 r5 Register Arguments (Second 32 bits) r21 r6 Register Arguments (Third 32 bits) r22 r7 Register Arguments (Fourth 32 bits) r23 r8 Caller-Saved r24 et Exception Temporary r9 General-Purpose Registers r25 bt Break Temporary r10 r26 gp Global Pointer r11 r27 sp Stack Pointer r12 r28 fp Frame Pointer r14 r29 ea Exception Return Address r14 r30 ba Break Return Address

r15 r31 ra Return Address 93

VHDL 32 bit edge detector library ieee; use ieee.std_logic_1164.all; entity re32d2 is port(data : in std_logic_vector(31 downto 0); clk : in std_logic; q : out std_logic); end entity; architecture rtl of re32d2 is signal r15, r16, r17: std_logic_vector(31 downto 0); --the shift register signal r18: std_logic:='0'; --output flip-flop begin process (clk) begin if (rising_edge(clk)) then if((r15 and r16 and (not r17))/=X"00000000") then r18<=not r18; end if; r17<=r16; r16<=r15; r15<=data; --shift end if; end process; q<=r18; --write output end rtl; 94

47 4.12.2014

NIOS SW Edge Detector r9 lpm_or32to1 r9

q[31..0] 32

32 0 inst7 data[17..0] data[31..18]

INPUT sw[17..0] VCC

14 inst1

5

GND inst

32

32 32 data[31..0] lpm_and inst10 2 XOR 4 NOT lpm_tff0 6 DFF lpm_tff0 3 lpm_tff0 DFF data[31..0] DFF VCC q[31..0] data[31..0] DFF PRN clock q[31..0] data[31..0] D Q

clock q[31..0]

clock

1 aclr

r15 aclr inst5 9 CLRN r16 8 aclr inst4 r17 inst6

inst3 VCC

GND r18 FreqDivByEven F=10kHz OUTPUT LEDG[0] CLOCK_50 INPUT 7 VCC CLK q Parameter Value Type inst15 Even_Divisor 5000 Unsigned Integer 5. cmpne r9, r9, zero 1. ldwio r15,SW(r8) 6. xor r18, r18, r9 2. nor r9, zero, r17 7. stwio r18, LEDR(r8) 3. and r9, r9, r16 8. mov r17, r16 4. and r9, r9, r15 9. mov r16, r15 95

Low-level C-code ~ Assembler

.equ IOBASE, 0x10000000 const int IXLEDG=0x10/sizeof(int); .equ LEDG, 0x10 const int IXSW =0x40/sizeof(int); It corresponds .equ SW, 0x40 int main(void) to our .text { assembler .global _start int reg17, reg16, reg15; code _start: volatile int * const IOBASE= 0x10000000; movia r8, IOBASE int reg18=0; mov r18, zero

LOOP: while(1) ldwio r15, SW(r8) { reg15 = IOBASE[IXSW]; nor r9, zero, r17 /* r9:=notr17*/ if(reg15 & reg16 & ~reg17) and r9, r9, r16

and r9, r9, r15 reg18= reg18^1; cmpne r9, r9, zero

xor r18, r18, r9 // store result stwio r18, LEDR(r8) IOBASE[IXLEDG] = reg18; mov r17, r16 reg17=reg16; // shift mov r16, r15 reg16=reg15; br LOOP } .end } 96 96

48 4.12.2014

Higher level C-code int main(void) { volatile int * lpLEDG = (int *) 0x10000010; // red LED address volatile int * lpSW= (int *) 0x10000040; // SW switch address

int SW_mem[3], SW_value, ledVal=0;

while(1) { SW_value = lpSW[0]; // *lpSW read the SW slider switch values if(SW_mem[0] & SW_mem[1] & ~SW_mem[2]) ledVal= ledVal^1; // light up the green LEDs lpLEDG[0] = ledVal; int i; for(i=2; i>0; i--) SW_mem[i]=SW_mem[i-1]; SW_mem[0]=SW_value; // *SW_mem=SW_value } } High level C-code is compiled into ~28 NIOS instructions, the low-level into 15 NIOS instructions 97

49