Principles of

Software Emulation

Joe Bertolami www.bertolami.com [email protected]

Part I:

Part II:

Part III: Part I

Emulation Basics What is an emulator?

A software application written for a host system that mimics the behavior of a target system. This enables software originally written for the target system to be executed on the host. Q: Why bother to write an emulator? 1 Preservation: Protect our ability to use legacy platforms long after their extinction. 2 Education: Learn about platform development through a fun technical challenge. 3 Software Piracy: Play your favorite games without having to buy them or the console. PIRACY PIRACY HURTS HURTS KITTENS! KITTENS! 4 Commercial: Meet a business need, such as supporting backwards compatibility or pre-hardware development. non-gaming use cases

Development

Testing

Legacy hardware emulation != simulation Precise reconstruction of target hardware and software. Emulation Provides ability to directly execute programs compiled for the target platform.

Approximate reconstruction of target software (only). Simulation Generally produces similar behavior, but is compiled for the host platform. Emulation Pattern

Compiled for the target platform

Compiled for the host platform, but emulates the target hardware Simulation Pattern

Compiled for the host platform Emulation vs. Simulation

Application source code

Compiled for target platform Compiled for host platform

Runs on Runs under Runs under Runs on target platform emulation simulation host platform Emulation vs. Simulation

Emulators Simulators Emulation vs. Simulation

simulator…

● typically more efficient ○ ○

● may be less complex ○ Emulation vs. Simulation

emulator…

● typically a more precise ○ ○ ○

● may be the only practical way ○ ○ Q: Why is emulation so difficult? Emulation of 1 target instruction usually requires >1 host instruction

Translating instructions or data isn’t always straightforward

Target software may exercise obscure, faulty, and undocumented parts of the hardware

Some platforms don’t want to be emulated Emulation in Practice

Most emulators are approximations Emulation in Practice

Precise emulation (e.g. cycle-level timing accuracy) requires significantly more horsepower

(See: Why Perfect Hardware SNES Emulation Requires a 3 GHz CPU) CPU Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier

Atari 375x 2600

Nintendo 68x NES

Nintendo 7.2x Super NES

Sony 11.2x Playstation

Sega 8x Dreamcast

Microsoft 7.85x Xbox

Φ Determined by date of first stable release that reasonably mimics original functionality. CPU Transistor Count Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier

Atari 375x 2600

Nintendo 68x NES

Nintendo 7.2x Super NES Emulation scene took off in the 1990s as consumer grade hardware became

Sony capable of emulating early game systems, and the Internet provided: 11.2x Playstation ● Access to a community of developers and enthusiasts ● Access to platform information and tools for reverse engineering Sega 8x Dreamcast ● Distribution channel for emulators and applications

Microsoft 7.85x Xbox

Φ Determined by date of first stable release that reasonably mimics original functionality. CPU Transistor Count Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier

Atari 375x 2600

Nintendo 68x NES

Nintendo 7.2x Super NES Rule of thumb: Ignoring outliers, it generally takes about 8-10x the horsepower in the host to emulate a target system

Sony 11.2x Playstation

Sega 8x Dreamcast

Microsoft 7.85x Xbox

Φ Determined by date of first stable release that reasonably mimics original functionality. CPU Transistor Count Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier

Atari 375x 2600

Nintendo 68x NES The Original Xbox emulator was built by Microsoft with the aid of full documentation. Even with this advantage, the emulator was never able to Nintendo support the entire Xbox game catalog. 7.2x Super NES

The original Xbox (target) and Xbox 360 (host) were significantly different Sony platforms in almost every way, and asymmetries in their CPU 11.2x Playstation architectures made this an extremely challenging project.

Sega 8x Dreamcast

Microsoft 7.85x Xbox

Φ Determined by date of first stable release that reasonably mimics original functionality. Emulation in Practice

greatly slow emulator development

● ● ● ●

Example Part II

Emulator Architecture Building an Emulator

Research: Reverse engineer or obtain detailed specs that describe every part of the system. This likely includes: CPU, GPU, APU, DSPs, input, memory, storage, media, and network.

Build: Write the logic for each component, including their interconnections, boot processes, and interrupt handlers.

Test: Experiment with target software to find bugs and performance traps.↻ Make adjustments as needed. low level emulation high level emulation Emulation Levels

Low level emulation High level emulation

Imitate a low level hardware interface Intercept application calls to target hardware implementing virtual and route them to high level host APIs

Advantages: Advantages: Emulation Levels

Low level emulation

def render_frame()

ASL $43, X Pseudo-snippet: Clear the screen using target platform API ROL $F8, X 13 AND ($43, X) 14 def render_frame(): JSR 15 clear_screen(BLACK_COLOR) 16 … some logic that renders the frame … ... def clear_screen()

33 def clear_screen(color): AND $11F8, Y 34 … some logic that controls hardware … EOR $(F8, X) 35 SED ADC ($03, X) Emulation Levels

High level emulation

def render_frame()

ASL $43, X Pseudo-snippet: Clear the screen using target platform API ROL $F8, X 13 AND ($43, X) 14 def render_frame(): JSR 15 clear_screen(BLACK_COLOR) 16 … some logic that renders the frame … ... Intercept and route to a native host routine 33 def clear_screen(color): 34 … some logic that controls hardware … 35 Emulation Architecture

CPU Virtual CPU

GPU Virtual display unit

APU Virtual audio unit

Memory Allocated memory buffers

Physical storage Data files

Controllers USB device managers

Network controllers Socket manager Host Input APIs Host Output APIs SO, LET’S BUILD AN EMULATOR Nintendo Family Computer

● Released 10/15/1985

● ○ ○

● Sold 62M units $7B total revenue by 1992 ○ ○

● A Nintendo product, but also a partnership Nintendo Famicom: System Overview Nintendo Famicom: System Overview

Controller Ports Expansion Slot

PPU Power Switch CPU + APU Lockout Chip 2x2KB RAM

Cartridge Slot Display Output Nintendo Famicom: System Overview

Controller Ports

PPU

CPU

2x2KB RAM

Minimum viable emulator (interactive images on screen) Nintendo Famicom: System Overview

Lockout Chip Nintendo Famicom: Copy Protection

lockout chip System Components We’ll Emulate

CPU: Ricoh 6502 (inside an 2A03 package) ● ● ● ●

PPU: Ricoh 2C02 ● ● ●

Input: 2 Controller Ports ● ● Nintendo Famicom: Controller (NES) Nintendo Famicom: Game Cartridge

● Memory cartridges

● remarkably flaky

Did you have a special workaround to fix a bad connection? Nope.

The only thing that helped was removing the cartridge and reinserting (source). Nintendo Famicom: Cartridge Overview Nintendo Famicom: Cartridge Overview

Battery

Mapper / MMC WRAM Lockout Key (CIC)

CHR ROM PRG ROM Nintendo Famicom: Cartridge Overview

CHR ROM PRG ROM

Minimum viable emulator (images on screen) Cartridge Components We’ll Emulate

Cartridge ● only accessible by the CPU ●

Cartridge ● only accessible by the PPU ● Nintendo Famicom: Cartridge Overview

● Multiple variations exist

○ Metal Slader Glory ○ ○ ○ Metal Slader Glory is the largest officially licensed NES game ever ● Sophistication generally correlates with release year created. It required a whopping 1 MB of storage, split between a 512 KB PRG ROM and a 512 KB CHR ROM. ● WRAM is used for save games System Coordination mimic the architecture Our Emulator Design

CPU: Input:

PPU: Game Cartridge Memory Layout Memory Overview

● CPU has access to: ○ ○ Typically 32 KB. ○

● PPU has access to: ○ ○ Typically 8 KB. ○

● CPU and PPU both: ○ ○ easily fit all system and cartridge data in host memory extremely simple

*The NES does support , but we won’t need it for our purposes. Memory Map — CPU PPU RAM CART

0x8000 — 0xFFFF Cartridge PRG ROM (32 KB)

16-bit addresses 0x6000 — 0x7FFF Cartridge WRAM (8 KB) 8 bit word size 0x4000 — 0x5FFF APU and Controller registers

0x2000 — 0x3FFF PPU Registers (8 KB) 64K total address range, but (8 mirrored registers) only ~50KB usable memory 0x1800 — 0x1FFF due to address mirroring 0x1000 — 0x17FF Mirrors of CPU RAM (6 KB) We can easily fit all of this in 0x0800 — 0x0FFF RAM on a modern system! 0x0000 — 0x07FF CPU RAM (2 KB) Memory Map — CPU

32 KB Cartridge PRG ROM 16-bit addresses 8 bit word size

8 KB PPU Registers 64K total address range, but (8 mirrored registers) only ~50KB usable memory due to address mirroring

We can easily fit all of this in RAM on a modern system! 2 KB CPU RAM Memory Map — CPU

Game code 32 KB Cartridge PRG ROM

How the CPU talks to the PPU 8 KB PPU Registers (8 mirrored registers)

CPU working memory 2 KB CPU RAM uint8 read_cpu_byte(uint16 address) SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { if (address >= 0x8000) { 89 Read from PRG ROM 90 return game_cart->program_rom[address - 0x8000]; (on the cartridge) 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { Read from WRAM 92 return game_cart->save_ram[address - 0x6000]; (on the cartridge) 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; Read controller state 93 } else if (address == 0x4016 || address == 0x4017) { described as an 8 bit 94 uint8 controller_idx = address - 0x4016; value at one of two 95 return keypads[controller_idx]->read() addresses (registers) 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; Read PPU register.

91 } else if (address >= 0x6000) { There are 8 of them, at 92 return game_cart->save_ram[address - 0x6000]; 0x2000 to 0x2007, and 93 } else if (address == 0x4016 || address == 0x4017) { then repeated to 94 uint8 controller_idx = address - 0x4016; 0x3FFF. 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { Read from 2KB CPU RAM. 99 return system_ram[address & 0x7FF]; 100 } Mirrored after 0x7FF, so 101 return 0; we always read from the 102 } lowest 2KB. 103 Memory Map — PPU PPU RAM CART

0x3F20 — 0x3FFF Mirror of RAM

16-bit addresses 0x3F00 — 0x3F1F Palette RAM (32 B) 8 bit word size 0x3800 — 0x3EFF

16K total address range, but 0x3000 — 0x37FF Mirrors of PPU RAM (~6 KB) only ~10KB usable memory 0x2800 — 0x2FFF due to address mirroring 0x2000 — 0x27FF PPU RAM (2 KB) We can easily fit all of this in 0x0000 — 0x1FFF CHR ROM (8 KB) RAM on a modern system!

0x00 — 0x0F Object Attributes (256 B) Memory Map — PPU

16-bit addresses 32 B Palette RAM 8 bit word size

16K total address range, but only ~10KB usable memory due to address mirroring 2 KB PPU RAM We can easily fit all of this in 8 KB CHR ROM RAM on a modern system!

256 B Object Attributes Memory Map — PPU

Colors currently in use 32 B Palette RAM

Background framebuffer 2 KB PPU RAM

Background & sprite tiles 8 KB CHR ROM

Sprite data (position, flips, etc.) 256 B Object Attributes We’ll come back to PPU memory in a bit and show how the CPU and PPU manage their data to present a layered and animated world.

Don’t worry if this sounds a bit abstract for now! uint8 read_ppu_byte(uint16 address) SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { 230 address &= 0x7FF; 231 } else { 232 address &= 0xBFF; 233 } 234 return video_ram[address & 0x7FF]; 235 } else { 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { Read palette (32B, mirrored) 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { 230 address &= 0x7FF; 231 } else { 232 address &= 0xBFF; 233 } 234 return video_ram[address & 0x7FF]; 235 } else { 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { Read from VRAM (2KB).

230 address &= 0x7FF; Vertical and horizontal modes affect the 231 } else { way that we organize data in this RAM. 232 address &= 0xBFF; 233 } More on this in a bit. 234 return video_ram[address & 0x7FF]; 235 } else { 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { 230 address &= 0x7FF; 231 } else { 232 address &= 0xBFF; 233 } 234 return video_ram[address & 0x7FF]; 235 } else { Read from CHR ROM (8KB) 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 uint8 read_cpu_byte(uint16 address) uint8 read_ppu_byte(uint16 address)

Whenever we detect the CPU or PPU attempting to read memory, we’ll call read_cpu_byte or read_ppu_byte instead. We’ll also need complementary write methods, as well as the ability to read and write 16 bit values with the CPU (which is implemented as two consecutive 8 bit memory operations) Memory Interface

uint8 read_cpu_byte(uint16 address) uint8 read_ppu_byte(uint16 address)

void write_cpu_byte(uint16 address, uint8 input) void write_ppu_byte(uint16 address, uint8 input)

uint16 read_cpu_short(uint16 address) void write_cpu_short(uint16 address, uint16 input) CPU Emulation Emulating the CPU: Inside the 6502

● 8-bit processor with a 16-bit address bus

● 6 Registers 6502 Processor ○

○ This popular microprocessor design was ○ used in multiple systems including the NES, ○ , Apple IIe, and the Commodore ○ 64. ○ It was a relatively inexpensive yet versatile processor, and was considered to be more ● 53 opcodes developer friendly than other available ○ options (e.g. the Z80) at the time. ○ ○ ○ ○ Emulating the CPU: Inside the 6502

● 8-bit processor with a 16-bit address bus

● 6 Registers 6502 Processor ○

○ This popular microprocessor design was ○ used in multiple systems including the NES, ○ Atari 2600, Apple IIe, and the Commodore ○ 64. ○ It was a relatively inexpensive yet versatile processor, and was considered to be more ● 53 opcodes developer friendly than other available ○ options (e.g. the Z80) at the time. ○ ○ ○ ○ Program Flow 8 bits wide 0xFFFF PRG ROM (32 KB) Program Counter (PC): ● ● ● ● PC 0xC000

Stack Pointer (SP): ● ● SP 0x00FD ● ● ● ● CPU RAM (2 KB) 0x0000 Status Register (SR)

N V B D I Z C

N set if arithmetic operation produced a negative value V set if arithmetic operation resulted in overflow or underflow B break interrupt received (may indicate a debug stop or reset) D decimal mode, not supported by the NES I set to disable maskable interrupts Z set if an operation produced a zero value C set if an operation produced a carry or borrow (includes shifts!) FETCH DECODE EXECUTE REPEAT Opcode Lifecycle

1. Fetch:

2. Decode:

3. Execute:

SNIPPET 3 — CPU CYCLE 726 727 void virtual_cpu::cycle() { 728 handle_interrupt(); 729 uint8 opcode = memory_bus->read_opcode(registers.pc); /* fetch */ 730 uint16 operand_address = decode_opcode(opcode); /* decode */ 731 execute(opcode, operand_address); /* execute */ 732 registers.pc += op_length_table[opcode]; 733 } 734 ARITHMETIC & LOGIC (+, –, and, or, compare, inc, dec, rotate)

ADC, AND, ASL, CMP, CPX, CPY, DEC, DEX, DEY, EOR, INC, INX, INY, LSR, ORA, ROL, ROR, SBC, ACC_ASL, ACC_LSR, ACC_ROL, ACC_ROR

CONTROL FLOW (branch, conditional branch, return)

BCC, BCS, BEQ, BMI, BNE, BPL, BVC, BVS, JMP, JSR, RTI, RTS

TRANSFER (load from memory, store to memory, transfer between registers)

LDA, LDX, LDY, STA, STX, STY, TAX, TAY, TSX, TXA, TXS, TYA, PHA, PHP, PLA, PLP

INTERRUPT (halt execution and execute handler) STATUS

INT BIT, BRK (Hey, where are MUL and DIV?!) Where’s the MUL? DIV?!

developers were often able to avoid the need for MUL and DIV simply by restructuring their logic Let’s walk through an example ADC

ADC operand // A = A + operand

All opcodes are 1 byte long, Let’s walk through an example ADC

ADC operand // A = A + operand

Most operands are also 1 byte long CPU Cycle for ADC Instruction

0xC002 01000010 0xC001 PC 01101001 0xC000

CPU pipeline state: Register state:

Opcode: 0x00 PC: 0xC000 Operand 1: 0x00 SR: 0x04 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction

Step 1 0xC002 PC 01000010 0xC001 (b) PC 01101001 0xC000

(a)

CPU pipeline state: Register state:

Opcode: 0x69 PC: 0xC001 Operand 1: 0x00 SR: 0x04 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction

Step 1 PC 0xC002 (b) PC 01000010 0xC001 01101001 0xC000

Step 2

(a)

CPU pipeline state: Register state:

Opcode: 0x69 PC: 0xC002 Operand 1: 0x42 SR: 0x04 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction

Step 1 PC 0xC002 01000010 0xC001 01101001 0xC000

Step 2

CPU pipeline state: Register state:

Opcode: 0x69 PC: 0xC002 Operand 1: 0x42 SR: 0x04 Step 3 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction

Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x00

SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; 45 status_reg.carry_bit = !!((result) & 0xFF00); 46 status_reg.negative_bit = !!((result) & 0x80); 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; 50 } 51 CPU Cycle for ADC Instruction

Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x00

SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; result = A + 0x42 45 status_reg.carry_bit = !!((result) & 0xFF00); 46 status_reg.negative_bit = !!((result) & 0x80); 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; 50 } 51 CPU Cycle for ADC Instruction

Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x00

SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; 45 status_reg.carry_bit = !!((result) & 0xFF00); SR updated to indicate interesting 46 status_reg.negative_bit = !!((result) & 0x80); status about the operation. 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; 50 } 51 CPU Cycle for ADC Instruction

Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x42

SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; 45 status_reg.carry_bit = !!((result) & 0xFF00); 46 status_reg.negative_bit = !!((result) & 0x80); 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; Actually store the value in register A 50 } 51 CPU Addressing Modes

addressing modes Memory Map — CPU (Revisited) 16-bit addresses (64K range) 8 bit word size (64 KB total size)

32 KB Cartridge PRG ROM

16 KB

High byte 0x4400 Low byte

page index byte index in page 8 KB PPU Registers (ranges 0 to 255) (ranges 0 to 255) 2 KB

2 KB Mirrors of CPU RAM first page of our memory map 2 KB

2 KB CPU RAM zero page 6502 Memory Access

OPERAND ADDRESSING MODES

Operand addressing modes specify a variety of different ways that we can store and compute a memory address for use in fetching an operand value.

This is one of the most confusing parts about 6502 emulation!

(I apologize for the next slide!) Mode type Operand syntax Operation

ADC @(0x4400)

ADC @(0x4400+X)

ADC @(0x4400+Y)

ADC 0x42

ADC @(0x44)

ADC @(0x44+X)

ADC @(0x44+Y)

ADC @@(0x4400)

ADC @@(0x44+X)

ADC @(@(0x44)+Y) Addressing Modes — Absolute

Operand is the value located ADC @(0x4400) at address 0x4400

Operand is the value located ADC @(0x4400+X) at address 0x4400+X

Operand is the value located ADC @(0x4400+Y) at address 0x4400+Y Addressing Modes — Immediate

ADC 0x42 Operand is the value 0x42 Addressing Modes — Zero Page

Operand is the value located ADC @(0x44) at address 0x0044 (in the zero page)

Operand is the value located ADC @(0x44+X) at address 0x0044+X (in the zero page)

Operand is the value located ADC @(0x44+Y) at address 0x0044+Y (in the zero page) Addressing Modes — Indirect

Fetch an address from ADC @@(0x4400) 0x4400, use it to fetch operand.

Fetch an address from ADC @@(0x44+X) 0x0044+X, use it to fetch operand.

Fetch an address from ADC @(@(0x44)+Y) 0x0044, add Y, use result to fetch operand. JMP Instruction

JMP absolute addressing version indirect addressing version CPU Cycle for JMP (Absolute Addressing)

JMP 0x4400 // set PC to 0x4400 01000100 0xC002 00000000 0xC001 PC 01001100 0xC000

Register state:

PC: 0xC000

Pipeline state:

Opcode: 0x00 Cache: 0x0000 CPU Cycle for JMP (Absolute Addressing)

JMP 0x4400 // set PC to 0x4400 01000100 0xC002 PC 00000000 0xC001 Step 1 (b) PC 01001100 0xC000

Register state:

(a) PC: 0xC001

Pipeline state:

Opcode: 0x4C Cache: 0x0000 CPU Cycle for JMP (Absolute Addressing)

JMP 0x4400 // set PC to 0x4400 PC (b) 01000100 0xC002 00000000 0xC001 Step 1 PC 01001100 0xC000

Step 2 Register state:

(a) PC: 0xC003

Pipeline state:

Opcode: 0x4C Cache: 0x4400 CPU Cycle for JMP (Absolute Addressing)

Step 3 Register state:

PC: 0xC003

Pipeline state:

Cache: 0x4400

SNIPPET 4 — EXECUTE JMP OPCODE 342 343 void _execute_opcode_jmp(uint16 cached_address) { 344 registers.pc = cached_address; 345 } 346 347 CPU Cycle for JMP (Absolute Addressing)

Step 3 Register state:

PC: 0x4400

Pipeline state:

Cache: 0x4400

SNIPPET 4 — EXECUTE JMP OPCODE 342 343 void _execute_opcode_jmp(uint16 cached_address) { Set PC to our target address. Next CPU 344 registers.pc = cached_address; cycle will begin execution at this address. 345 } 346 347 CPU Cycle for JMP (Indirect Addressing)

JMP @(0x4400) // set PC to the address stored at 0x4400

JMP

→ → Hazard: 6502 Indirect Address Bug

JMP

you may need to replicate buggy platform behavior Hazard: 6502 Indirect Address Bug

0x01FF

we would Page 1 expect our loaded address to be 0x004A

high byte fetch 01000100 00000000 0x0100 01000110 0x00FF low byte fetch 0x004A Page 0

00100010 0x0000 Hazard: 6502 Indirect Address Bug

0x01FF

Page 1

Thus we expected a value of 0x004A, but received a value of 0x224A. 01000100 00000000 0x0100 01000110 0x00FF low byte fetch 0x224A Page 0

erroneous high byte fetch 00100010 0x0000 Little Endian vs. Big Endian

Another important detail

This is known as little endian order.

big endian order

X86, X64 (i.e. , AMD, etc.), 6502 Little endian

Power PC, 68000, several game consoles Big endian

ARM, MIPS Both (configurable)

(Homework: think of the pros/cons to each ordering method) Achievement Unlocked: CPU Emulation

interrupts cycle count PPU Emulation Emulating the PPU: Inside the 2C02

● Innovative memory architecture

● Supported detailed graphics heavy reuse of low fidelity graphics data Memory Efficiency

0.5 KB

● ● Principles of PPU Emulation

five key concepts Background vs. Foreground

+ = Background vs. Foreground

Background where how colors

Foreground PPU Operation PPU Operation

frame buffer nametable

our frame is all black by default PPU Memory Map

0x3F20 — 0x3FFF

Palette RAM (32 B) 0x3F00 — 0x3F1F Where is our frame buffer stored?

0x3000 — 0x3EFF

PPU RAM (4 KB) 0x2000 — 0x2FFF (2 KB mirrored)

CHR ROM (8 KB) 0x0000 — 0x1FFF

Object Attributes (256 B) 0x00 — 0x0F PPU Memory Map

0x3F20 — 0x3FFF

Palette RAM (32 B) 0x3F00 — 0x3F1F 0x2C00 — 0x2FFF Nametable 3 (1 KB)

0x3000 — 0x3EFF 0x2800 — 0x2BFF Nametable 2 (1 KB)

PPU RAM (4 KB) 0x2000 — 0x2FFF 0x2400 — 0x27FF Nametable 1 (1 KB) (2 KB mirrored)

CHR ROM (8 KB) 0x0000 — 0x1FFF 0x2000 — 0x23FF Nametable 0 (1 KB)

Object Attributes (256 B) 0x00 — 0x0F PPU Memory Map

0x3F20 — 0x3FFF

Palette RAM (32 B) 0x3F00 — 0x3F1F 0x2C00 — 0x2FFF Nametable 3 (1 KB)

0x3000 — 0x3EFF 0x2800 — 0x2BFF Nametable 2 (1 KB)

PPU RAM (4 KB) 0x2000 — 0x2FFF 0x2400 — 0x27FF Nametable 1 (1 KB) (2 KB mirrored)

CHR ROM (8 KB) 0x0000 — 0x1FFF 0x2000 — 0x23FF Nametable 0 (1 KB)

Object Attributes Our frame buffer is here, in the low 1 KB of (256 B) 0x00 — 0x0F our PPU RAM.

We’ll talk about the additional nametables later, when we discuss . PPU Memory Map

256x240 256 4 0 PPU Memory Map

256x240 256 4 0 8x8 represented by one references to tiles tile reference in our framebuffer PPU Memory Map

256x240

references to tiles

960 tile references 960 bytes

32x30 tiles 960 bytes total Tile Based Rendering

exactly Tile Based Rendering

tiles, which are reusable image patterns

● Background tile patterns

● Foreground sprite patterns PPU Memory Map

Each 8x8 tile requires 16 bytes of memory

0x3F20 — 0x3FFF

0x3F00 — 0x3F1F

0x3000 — 0x3EFF

0x2000 — 0x2FFF PPU RAM (4KB) games typically supported up to 256 (2KB mirrored) different background tile patterns 0x0000 — 0x1FFF CHR ROM (8KB) Anatomy of a CHR tile

output 256x240 image pixels output 8x8 image pixels Anatomy of a CHR tile

00 00 00 00 00 00 00 11

00 00 00 00 00 00 11 10

00 00 00 00 00 11 10 10

00 00 00 00 11 10 10 10

00010000 00 00 00 11 10 10 10 10

00 00 11 10 10 10 10 10

00 11 10 10 10 10 10 10

11 10 10 10 10 10 10 10

frame buffer tile reference referenced CHR tile from CHR ROM Anatomy of a CHR tile

00 00 00 00 00 00 00 11 not directly hold a color value it holds an index into a color palette 00 00 00 00 00 00 11 10

00 00 00 00 00 11 10 10

00 00 00 00 11 10 10 10

00 00 00 11 10 10 10 10

00 00 11 10 10 10 10 10 00 11 10 10 10 10 10 10 00 01 10 11 11 10 10 10 10 10 10 10 Anatomy of a CHR tile

00 00 00 00 00 00 00 11 00 00 00 00 00 00 00 11

00 00 00 00 00 00 11 10 00 00 00 00 00 00 11 10

00 00 00 00 00 11 10 10 00 00 00 00 00 11 10 10

00 00 00 00 11 10 10 10 00 00 00 00 11 10 10 10

00 00 00 11 10 10 10 10 00 00 00 11 10 10 10 10

00 00 11 10 10 10 10 10 00 00 11 10 10 10 10 10

00 11 10 10 10 10 10 10 00 11 10 10 10 10 10 10

11 10 10 10 10 10 10 10 11 10 10 10 10 10 10 10 64 colors in total

Notice that this color space wastes 10 values for black, lacks a true yellow, and has a fairly poor (making transitions and fades difficult) 64 colors in total

00 01 10 11 Further palette limitations:

● ○ ○ Further palette limitations:

● ○ ○

common background transparent PPU Palette Storage

0x3F20 — 0x3FFF

Palette RAM (32 B) 0x3F00 — 0x3F1F

0x3000 — 0x3EFF

PPU RAM (4 KB) 0x2000 — 0x2FFF (2 KB mirrored)

CHR ROM (8 KB) 0x0000 — 0x1FFF

Object Attributes (256 B) 0x00 — 0x0F PPU Palette Restrictions

One final palette limitation meta-tile

meta-tile boundary tile boundary Tile Based Rendering

30 tiles used in the frame, requiring 480 bytes in the CHR ROM

22 background tiles (352 bytes)

8 sprite tiles (128 bytes) Tile Based Rendering

22 unique tiles used in the frame, requiring 352 bytes in the CHR ROM

22 15 background tiles (240 bytes)

!

8 7 sprite tiles (112 bytes)

Anatomy of a Sprite

Sprites are actually fairly similar to background tiles, but with a few exceptions:

● ● ● ● scanline ● attributes

four separate 8x8 sprites Anatomy of a Sprite

0x3F20 — 0x3FFF

Sprites are 4 bytes each, so we can 0x3F00 — 0x3F1F Palette RAM (32 B) store up to 64 concurrent sprites in 0x3000 — 0x3EFF our 256 byte object attribute memory 0x2000 — 0x2FFF PPU RAM (4 KB) (2 KB mirrored)

0x0000 — 0x1FFF CHR ROM (8 KB)

0x00 — 0x0F Object Attributes (256 B) Anatomy of a Sprite

typedef struct sprite_desc { uint8 sprite_y; // sprite y pixel coordinate uint8 tile_reference; // indicates tile pattern to use for rendering uint8 attributes; // attributes (more on this next) uint8 sprite_x; // sprite x pixel coordinate } sprite_desc;

● ● ● Anatomy of a Sprite

V H M U U U P P

P palette select, uses two bits to select one of four 4-color sprite palettes U unused — never read, hopefully never written M background mask, indicates if sprite is behind the background H flip sprite horizontally V flip sprite vertically Anatomy of a Sprite

tile reuse Anatomy of a Sprite

even more tile reuse PPU Sprite Layering

● ● ● PPU Sprite Layering

object attribute memory secondary OAM

for each scanline: clear secondary OAM scan the sprites in the primary OAM (0 -> 63): if a sprite intersects the current scanline, write it into the secondary OAM scan the sprites in secondary OAM in reverse order (# sprites - 1 -> 0): render the sprite into the current scanline Megaman II (and others) “worked around” the 8 sprite per scanline limitation by reordering sprites each frame. This resulted in noticeable sprite flicker, but enabled some incredibly dynamic gameplay. Background Scrolling

smooth scrolling backgrounds Smooth Scrolling on a PC

The first widely known demo of a smooth this was a significant competitive advantage for the NES scrolling background on the PC was Dangerous Dave in Copyright Infringement (1990), which recreated a classic level from Super Mario Bros. 3.

It was created by John Carmack and Tom Hall, who would ultimately found the id Software game studio (with credits including Commander Keen, DOOM, and Quake series). Background Scrolling

No scrolling Horizontal scrolling Vertical scrolling Horizontal + vertical scrolling No scrolling: all of the action happens on a single screen. Horizontal scrolling: background pans horizontally to offer exploration of a larger world. Vertical scrolling: background pans vertically to offer exploration of a larger world. Full scrolling: background pans vertically and horizontally to offer exploration of a larger world. Nametable scrolling

● ● ● and ●

0x2800 - 0x2BFF 0x2C00 - 0x2EFF

0x2000 - 0x23FF 0x2400 - 0x27FF Nametable scrolling

● ● ● and ●

0x2800 - 0x2BFF 0x2C00 - 0x2EFF Mirrors

0x2000 - 0x23FF 0x2400 - 0x27FF Our 2 KB RAM Nametable scrolling

0x2000 0x2400 0x2800 0x2C00

nametable 0 nametable 1 nametable 0 mirror Nametable scrolling Scroll X: 0x2000

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror Nametable scrolling Scroll X: 0x2100

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror

No longer visible, so we update with new tiles (and reflected in our mirror) Nametable scrolling Scroll X: 0x2300

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror

No longer visible, so we update with new tiles (and reflected in our mirror) Nametable scrolling Scroll X: 0x2400

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror

nametable 0 now fully reset, ready to present fresh background Nametable scrolling Scroll X: 0x2500

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror

leveraging address mirroring for smooth nametable traversal Nametable scrolling Scroll X: 0x2700

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror

leveraging address mirroring for smooth nametable traversal Nametable scrolling Scroll X: 0x2800

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror

nametable 1 fully updated with fresh background Nametable scrolling Scroll X: 0x2000

0x2000 0x2400 0x2800 0x2C00

SCREEN

nametable 0 nametable 1 nametable 0 mirror

reset scroll x to 0x2000 once we complete a scroll cycle Concept Review

Five key concepts

✔ ✔ ✔ ✔ ✔ Input Basic Controller Support

● 0x4016 0x4017

Right Down Select

7 6 5 4 3 2 1 0

Left Up Start B A Part III

Advanced Topics Security

Why Security Matters (to the platform manufacturer)

Integrity of the platform ● ●

Integrity of the business ● ○ ○

Terminology: ● → ● → Security Model Goals

Prevent users from running arbitrary code ⇒

Prevent users from running duplicated code ⇒ Platform Security 101

Strategies:

● Platform verification → ○

● Application verification →

Common tactics: Platform Security 101

Strategies:

● Platform verification → ○

● Application verification →

Common tactics:

These are particularly challenging for emulation know that their security model will be breached Optimizations Just-in-time Compilation

With this design, each 6502 opcode will require many host opcodes

convert 6502 opcodes into host opcodes feed them directly into the host processor Just-in-time Compilation

● Just in time ○ ○

● Ahead of time ○ ○

Caveat: Graphics

● ○ ○ ○ ■

● try to pre-process graphical data ahead of time Multi-threaded Emulation

Lots of processing can be done in parallel

Multi-threaded emulation is often a necessity when targeting a multi-core platform PPU Tricks Squeezing the PPU Squeezing the PPU

Animated backgrounds Squeezing the PPU

Scanline scrolling backgrounds Squeezing the PPU

Animated palettes Squeezing the PPU

Spatial dithering

Debugging Debugging

● Virtual CPU debugger ○ ○ ■ ■ ○

● Virtual PPU/GPU debugger ○ ○

● General ○ ○ ○ ○ Going Further NES Components We Didn’t Cover

Processors —

Memory —

Audio —

Storage —

Input —

Output — Supporting Modern Systems

Processors —

Graphics —

Audio —

Network connectivity —

Convenience —

Storage —

Input —

Performance —

(and much more) Wrap Up Emulation is fun.

Emulation is often hard.

Emulation is about much more than games! Thanks for listening!

Grab the source to my NES emulator at:

https://github.com/ramenhut/simpleNES

Enjoyed the lecture? Check out more of my lectures at https://www.bertolami.com! Bonus: Top 5 NES Games by Sales

Super Mario Bros. Duck Hunt Super Mario Bros. 3 Super Mario Bros. 2 Legend of Zelda Sales Rank: #1 Sales Rank: #2 Sales Rank: #3 Sales Rank: #4 Sales Rank: #5 40.2M Units Sold 28.3M Units Sold 18M Units Sold 7.4M Units Sold 6.5M Units Sold The Zapper Bonus: How Did the Zapper Work? Bonus: How Did the Zapper Work?

Let’s take a look at how this worked with Duck Hunt. Bonus: How Did the Zapper Work?

They aim the zapper at the duck and pull the trigger.

They hear a gunshot sound, and see a flash Bonus: How Did the Zapper Work?

Frame 1: baseline Frame 2: light up the targets Frame 3: back to normal

light detected no light detected no light detected, Bonus: How Did the Zapper Work?

the system can render the multiple hit targets (white boxes) at different points in time

the zapper generally doesn’t work on modern displays due to the high amount of processing delay