Principles of
Software Emulation
Joe Bertolami www.bertolami.com [email protected]
Part I:
Part II:
Part III: Part I
Emulation Basics What is an emulator?
A software application written for a host system that mimics the behavior of a target system. This enables software originally written for the target system to be executed on the host. Q: Why bother to write an emulator? 1 Preservation: Protect our ability to use legacy platforms long after their extinction. 2 Education: Learn about platform development through a fun technical challenge. 3 Software Piracy: Play your favorite games without having to buy them or the console. PIRACY PIRACY HURTS HURTS KITTENS! KITTENS! 4 Commercial: Meet a business need, such as supporting backwards compatibility or pre-hardware development. non-gaming use cases
Development
Testing
Legacy hardware emulation != simulation Precise reconstruction of target hardware and software. Emulation Provides ability to directly execute programs compiled for the target platform.
✂
Approximate reconstruction of target software (only). Simulation Generally produces similar behavior, but is compiled for the host platform. Emulation Pattern
Compiled for the target platform
Compiled for the host platform, but emulates the target hardware Simulation Pattern
Compiled for the host platform Emulation vs. Simulation
Application source code
Compiled for target platform Compiled for host platform
Runs on Runs under Runs under Runs on target platform emulation simulation host platform Emulation vs. Simulation
Emulators Simulators Emulation vs. Simulation
simulator…
● typically more efficient ○ ○
● may be less complex ○ Emulation vs. Simulation
emulator…
● typically a more precise ○ ○ ○
● may be the only practical way ○ ○ Q: Why is emulation so difficult? Emulation of 1 target instruction usually requires >1 host instruction
Translating instructions or data isn’t always straightforward
Target software may exercise obscure, faulty, and undocumented parts of the hardware
Some platforms don’t want to be emulated Emulation in Practice
Most emulators are approximations Emulation in Practice
Precise emulation (e.g. cycle-level timing accuracy) requires significantly more horsepower
(See: Why Perfect Hardware SNES Emulation Requires a 3 GHz CPU) CPU Transistor Count Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier
Atari 375x 2600
Nintendo 68x NES
Nintendo 7.2x Super NES
Sony 11.2x Playstation
Sega 8x Dreamcast
Microsoft 7.85x Xbox
Φ Determined by date of first stable release that reasonably mimics original functionality. CPU Transistor Count Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier
Atari 375x 2600
Nintendo 68x NES
Nintendo 7.2x Super NES Emulation scene took off in the 1990s as consumer grade hardware became
Sony capable of emulating early game systems, and the Internet provided: 11.2x Playstation ● Access to a community of developers and enthusiasts ● Access to platform information and tools for reverse engineering Sega 8x Dreamcast ● Distribution channel for emulators and applications
Microsoft 7.85x Xbox
Φ Determined by date of first stable release that reasonably mimics original functionality. CPU Transistor Count Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier
Atari 375x 2600
Nintendo 68x NES
Nintendo 7.2x Super NES Rule of thumb: Ignoring outliers, it generally takes about 8-10x the horsepower in the host to emulate a target system
Sony 11.2x Playstation
Sega 8x Dreamcast
Microsoft 7.85x Xbox
Φ Determined by date of first stable release that reasonably mimics original functionality. CPU Transistor Count Host / Target Initial Target First Host Transistor ϕ release transistors Emulated transistors Multiplier
Atari 375x 2600
Nintendo 68x NES The Original Xbox emulator was built by Microsoft with the aid of full documentation. Even with this advantage, the emulator was never able to Nintendo support the entire Xbox game catalog. 7.2x Super NES
The original Xbox (target) and Xbox 360 (host) were significantly different Sony platforms in almost every way, and asymmetries in their CPU 11.2x Playstation architectures made this an extremely challenging project.
Sega 8x Dreamcast
Microsoft 7.85x Xbox
Φ Determined by date of first stable release that reasonably mimics original functionality. Emulation in Practice
greatly slow emulator development
● ● ● ●
Example Part II
Emulator Architecture Building an Emulator
Research: Reverse engineer or obtain detailed specs that describe every part of the system. This likely includes: CPU, GPU, APU, DSPs, input, memory, storage, media, and network.
Build: Write the logic for each component, including their interconnections, boot processes, and interrupt handlers.
Test: Experiment with target software to find bugs and performance traps.↻ Make adjustments as needed. low level emulation high level emulation Emulation Levels
Low level emulation High level emulation
Imitate a low level hardware interface Intercept application calls to target hardware implementing virtual and route them to high level host APIs
Advantages: Advantages: Emulation Levels
Low level emulation
def render_frame()
ASL $43, X Pseudo-snippet: Clear the screen using target platform API ROL $F8, X 13 AND ($43, X) 14 def render_frame(): JSR 15 clear_screen(BLACK_COLOR) 16 … some logic that renders the frame … ... def clear_screen()
33 def clear_screen(color): AND $11F8, Y 34 … some logic that controls hardware … EOR $(F8, X) 35 SED ADC ($03, X) Emulation Levels
High level emulation
def render_frame()
ASL $43, X Pseudo-snippet: Clear the screen using target platform API ROL $F8, X 13 AND ($43, X) 14 def render_frame(): JSR 15 clear_screen(BLACK_COLOR) 16 … some logic that renders the frame … ... Intercept and route to a native host routine 33 def clear_screen(color): 34 … some logic that controls hardware … 35 Emulation Architecture
CPU Virtual CPU
GPU Virtual display unit
APU Virtual audio unit
Memory Allocated memory buffers
Physical storage Data files
Controllers USB device managers
Network controllers Socket manager Host Input APIs Host Output APIs SO, LET’S BUILD AN EMULATOR Nintendo Family Computer
● Released 10/15/1985
● ○ ○
● Sold 62M units $7B total revenue by 1992 ○ ○
●
●
● A Nintendo product, but also a partnership Nintendo Famicom: System Overview Nintendo Famicom: System Overview
Controller Ports Expansion Slot
PPU Power Switch CPU + APU Lockout Chip 2x2KB RAM
Cartridge Slot Display Output Nintendo Famicom: System Overview
Controller Ports
PPU
CPU
2x2KB RAM
Minimum viable emulator (interactive images on screen) Nintendo Famicom: System Overview
Lockout Chip Nintendo Famicom: Copy Protection
lockout chip System Components We’ll Emulate
CPU: Ricoh 6502 (inside an 2A03 package) ● ● ● ●
PPU: Ricoh 2C02 ● ● ●
Input: 2 Controller Ports ● ● Nintendo Famicom: Controller (NES) Nintendo Famicom: Game Cartridge
● Memory cartridges
● remarkably flaky
Did you have a special workaround to fix a bad connection? Nope.
The only thing that helped was removing the cartridge and reinserting (source). Nintendo Famicom: Cartridge Overview Nintendo Famicom: Cartridge Overview
Battery
Mapper / MMC WRAM Lockout Key (CIC)
CHR ROM PRG ROM Nintendo Famicom: Cartridge Overview
CHR ROM PRG ROM
Minimum viable emulator (images on screen) Cartridge Components We’ll Emulate
Cartridge ● only accessible by the CPU ●
Cartridge ● only accessible by the PPU ● Nintendo Famicom: Cartridge Overview
● Multiple variations exist
○ Metal Slader Glory ○ ○ ○ Metal Slader Glory is the largest officially licensed NES game ever ● Sophistication generally correlates with release year created. It required a whopping 1 MB of storage, split between a 512 KB PRG ROM and a 512 KB CHR ROM. ● WRAM is used for save games System Coordination mimic the architecture Our Emulator Design
CPU: Input:
PPU: Game Cartridge Memory Layout Memory Overview
● CPU has access to: ○ ○ Typically 32 KB. ○
● PPU has access to: ○ ○ Typically 8 KB. ○
● CPU and PPU both: ○ ○ easily fit all system and cartridge data in host memory extremely simple
*The NES does support bank switching, but we won’t need it for our purposes. Memory Map — CPU PPU RAM CART
0x8000 — 0xFFFF Cartridge PRG ROM (32 KB)
16-bit addresses 0x6000 — 0x7FFF Cartridge WRAM (8 KB) 8 bit word size 0x4000 — 0x5FFF APU and Controller registers
0x2000 — 0x3FFF PPU Registers (8 KB) 64K total address range, but (8 mirrored registers) only ~50KB usable memory 0x1800 — 0x1FFF due to address mirroring 0x1000 — 0x17FF Mirrors of CPU RAM (6 KB) We can easily fit all of this in 0x0800 — 0x0FFF RAM on a modern system! 0x0000 — 0x07FF CPU RAM (2 KB) Memory Map — CPU
32 KB Cartridge PRG ROM 16-bit addresses 8 bit word size
8 KB PPU Registers 64K total address range, but (8 mirrored registers) only ~50KB usable memory due to address mirroring
We can easily fit all of this in RAM on a modern system! 2 KB CPU RAM Memory Map — CPU
Game code 32 KB Cartridge PRG ROM
How the CPU talks to the PPU 8 KB PPU Registers (8 mirrored registers)
CPU working memory 2 KB CPU RAM uint8 read_cpu_byte(uint16 address) SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { if (address >= 0x8000) { 89 Read from PRG ROM 90 return game_cart->program_rom[address - 0x8000]; (on the cartridge) 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { Read from WRAM 92 return game_cart->save_ram[address - 0x6000]; (on the cartridge) 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; Read controller state 93 } else if (address == 0x4016 || address == 0x4017) { described as an 8 bit 94 uint8 controller_idx = address - 0x4016; value at one of two 95 return keypads[controller_idx]->read() addresses (registers) 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; Read PPU register.
91 } else if (address >= 0x6000) { There are 8 of them, at 92 return game_cart->save_ram[address - 0x6000]; 0x2000 to 0x2007, and 93 } else if (address == 0x4016 || address == 0x4017) { then repeated to 94 uint8 controller_idx = address - 0x4016; 0x3FFF. 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { 99 return system_ram[address & 0x7FF]; 100 } 101 return 0; 102 } 103 SNIPPET 1 — READ CPU MEMORY 87 88 uint8 system_bus::read_cpu_byte(uint16 address) { 89 if (address >= 0x8000) { 90 return game_cart->program_rom[address - 0x8000]; 91 } else if (address >= 0x6000) { 92 return game_cart->save_ram[address - 0x6000]; 93 } else if (address == 0x4016 || address == 0x4017) { 94 uint8 controller_idx = address - 0x4016; 95 return keypads[controller_idx]->read() 96 } else if (address >= 0x2000) { 97 return ppu->read_ppu_register((address - 0x2000) & 0x7); 98 } else { Read from 2KB CPU RAM. 99 return system_ram[address & 0x7FF]; 100 } Mirrored after 0x7FF, so 101 return 0; we always read from the 102 } lowest 2KB. 103 Memory Map — PPU PPU RAM CART
0x3F20 — 0x3FFF Mirror of Palette RAM
16-bit addresses 0x3F00 — 0x3F1F Palette RAM (32 B) 8 bit word size 0x3800 — 0x3EFF
16K total address range, but 0x3000 — 0x37FF Mirrors of PPU RAM (~6 KB) only ~10KB usable memory 0x2800 — 0x2FFF due to address mirroring 0x2000 — 0x27FF PPU RAM (2 KB) We can easily fit all of this in 0x0000 — 0x1FFF CHR ROM (8 KB) RAM on a modern system!
0x00 — 0x0F Object Attributes (256 B) Memory Map — PPU
16-bit addresses 32 B Palette RAM 8 bit word size
16K total address range, but only ~10KB usable memory due to address mirroring 2 KB PPU RAM We can easily fit all of this in 8 KB CHR ROM RAM on a modern system!
256 B Object Attributes Memory Map — PPU
Colors currently in use 32 B Palette RAM
Background framebuffer 2 KB PPU RAM
Background & sprite tiles 8 KB CHR ROM
Sprite data (position, flips, etc.) 256 B Object Attributes We’ll come back to PPU memory in a bit and show how the CPU and PPU manage their data to present a layered and animated world.
Don’t worry if this sounds a bit abstract for now! uint8 read_ppu_byte(uint16 address) SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { 230 address &= 0x7FF; 231 } else { 232 address &= 0xBFF; 233 } 234 return video_ram[address & 0x7FF]; 235 } else { 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { Read palette (32B, mirrored) 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { 230 address &= 0x7FF; 231 } else { 232 address &= 0xBFF; 233 } 234 return video_ram[address & 0x7FF]; 235 } else { 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { Read from VRAM (2KB).
230 address &= 0x7FF; Vertical and horizontal modes affect the 231 } else { way that we organize data in this RAM. 232 address &= 0xBFF; 233 } More on this in a bit. 234 return video_ram[address & 0x7FF]; 235 } else { 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 SNIPPET 2 — READ PPU MEMORY 224 225 uint8 system_bus::read_ppu_byte(uint16 address) { 226 if (address >= 0x3F00) { 227 return palette_ram[(address - 0x3F00) & 0x1F]; 228 } else if (address >= 0x2000) { 229 if (game_cart->header.mirror_mode) { 230 address &= 0x7FF; 231 } else { 232 address &= 0xBFF; 233 } 234 return video_ram[address & 0x7FF]; 235 } else { Read from CHR ROM (8KB) 236 return game_cart->tile_rom[address]; 237 } 238 return 0; 239 } 240 uint8 read_cpu_byte(uint16 address) uint8 read_ppu_byte(uint16 address)
Whenever we detect the CPU or PPU attempting to read memory, we’ll call read_cpu_byte or read_ppu_byte instead. We’ll also need complementary write methods, as well as the ability to read and write 16 bit values with the CPU (which is implemented as two consecutive 8 bit memory operations) Memory Interface
uint8 read_cpu_byte(uint16 address) uint8 read_ppu_byte(uint16 address)
void write_cpu_byte(uint16 address, uint8 input) void write_ppu_byte(uint16 address, uint8 input)
uint16 read_cpu_short(uint16 address) void write_cpu_short(uint16 address, uint16 input) CPU Emulation Emulating the CPU: Inside the 6502
● 8-bit processor with a 16-bit address bus
● 6 Registers 6502 Processor ○
○ This popular microprocessor design was ○ used in multiple systems including the NES, ○ Atari 2600, Apple IIe, and the Commodore ○ 64. ○ It was a relatively inexpensive yet versatile processor, and was considered to be more ● 53 opcodes developer friendly than other available ○ options (e.g. the Z80) at the time. ○ ○ ○ ○ Emulating the CPU: Inside the 6502
● 8-bit processor with a 16-bit address bus
● 6 Registers 6502 Processor ○
○ This popular microprocessor design was ○ used in multiple systems including the NES, ○ Atari 2600, Apple IIe, and the Commodore ○ 64. ○ It was a relatively inexpensive yet versatile processor, and was considered to be more ● 53 opcodes developer friendly than other available ○ options (e.g. the Z80) at the time. ○ ○ ○ ○ Program Flow 8 bits wide 0xFFFF PRG ROM (32 KB) Program Counter (PC): ● ● ● ● PC 0xC000
Stack Pointer (SP): ● ● SP 0x00FD ● ● ● ● CPU RAM (2 KB) 0x0000 Status Register (SR)
N V B D I Z C
N set if arithmetic operation produced a negative value V set if arithmetic operation resulted in overflow or underflow B break interrupt received (may indicate a debug stop or reset) D decimal mode, not supported by the NES I set to disable maskable interrupts Z set if an operation produced a zero value C set if an operation produced a carry or borrow (includes shifts!) FETCH DECODE EXECUTE REPEAT Opcode Lifecycle
1. Fetch:
2. Decode:
3. Execute:
SNIPPET 3 — CPU CYCLE 726 727 void virtual_cpu::cycle() { 728 handle_interrupt(); 729 uint8 opcode = memory_bus->read_opcode(registers.pc); /* fetch */ 730 uint16 operand_address = decode_opcode(opcode); /* decode */ 731 execute(opcode, operand_address); /* execute */ 732 registers.pc += op_length_table[opcode]; 733 } 734 ARITHMETIC & LOGIC (+, –, and, or, compare, inc, dec, rotate)
ADC, AND, ASL, CMP, CPX, CPY, DEC, DEX, DEY, EOR, INC, INX, INY, LSR, ORA, ROL, ROR, SBC, ACC_ASL, ACC_LSR, ACC_ROL, ACC_ROR
CONTROL FLOW (branch, conditional branch, return)
BCC, BCS, BEQ, BMI, BNE, BPL, BVC, BVS, JMP, JSR, RTI, RTS
TRANSFER (load from memory, store to memory, transfer between registers)
LDA, LDX, LDY, STA, STX, STY, TAX, TAY, TSX, TXA, TXS, TYA, PHA, PHP, PLA, PLP
INTERRUPT (halt execution and execute handler) STATUS
INT BIT, BRK (Hey, where are MUL and DIV?!) Where’s the MUL? DIV?!
developers were often able to avoid the need for MUL and DIV simply by restructuring their logic Let’s walk through an example ADC
ADC operand // A = A + operand
All opcodes are 1 byte long, Let’s walk through an example ADC
ADC operand // A = A + operand
Most operands are also 1 byte long CPU Cycle for ADC Instruction
0xC002 01000010 0xC001 PC 01101001 0xC000
CPU pipeline state: Register state:
Opcode: 0x00 PC: 0xC000 Operand 1: 0x00 SR: 0x04 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction
Step 1 0xC002 PC 01000010 0xC001 (b) PC 01101001 0xC000
(a)
CPU pipeline state: Register state:
Opcode: 0x69 PC: 0xC001 Operand 1: 0x00 SR: 0x04 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction
Step 1 PC 0xC002 (b) PC 01000010 0xC001 01101001 0xC000
Step 2
(a)
CPU pipeline state: Register state:
Opcode: 0x69 PC: 0xC002 Operand 1: 0x42 SR: 0x04 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction
Step 1 PC 0xC002 01000010 0xC001 01101001 0xC000
Step 2
CPU pipeline state: Register state:
Opcode: 0x69 PC: 0xC002 Operand 1: 0x42 SR: 0x04 Step 3 Operand 2: 0x00 A: 0x00 CPU Cycle for ADC Instruction
Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x00
SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; 45 status_reg.carry_bit = !!((result) & 0xFF00); 46 status_reg.negative_bit = !!((result) & 0x80); 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; 50 } 51 CPU Cycle for ADC Instruction
Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x00
SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; result = A + 0x42 45 status_reg.carry_bit = !!((result) & 0xFF00); 46 status_reg.negative_bit = !!((result) & 0x80); 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; 50 } 51 CPU Cycle for ADC Instruction
Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x00
SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; 45 status_reg.carry_bit = !!((result) & 0xFF00); SR updated to indicate interesting 46 status_reg.negative_bit = !!((result) & 0x80); status about the operation. 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; 50 } 51 CPU Cycle for ADC Instruction
Register state: Step 3 PC: 0xC002 SR: 0x04 A: 0x42
SNIPPET 3 — EXECUTE ADC OPCODE 42 43 void _execute_opcode_adc(uint8 operand) { 44 uint16 result = (uint16) registers.a + operand + status_reg.carry_bit; 45 status_reg.carry_bit = !!((result) & 0xFF00); 46 status_reg.negative_bit = !!((result) & 0x80); 47 status_reg.zero_bit = !((result) & 0xFF); 48 status_reg.overflow_bit = SAME_SIGN(registers.a, operand) && !SAME_SIGN(operand, result); 49 registers.a = result & 0xFF; Actually store the value in register A 50 } 51 CPU Addressing Modes
addressing modes Memory Map — CPU (Revisited) 16-bit addresses (64K range) 8 bit word size (64 KB total size)
32 KB Cartridge PRG ROM
16 KB
High byte 0x4400 Low byte
page index byte index in page 8 KB PPU Registers (ranges 0 to 255) (ranges 0 to 255) 2 KB
2 KB Mirrors of CPU RAM first page of our memory map 2 KB
2 KB CPU RAM zero page 6502 Memory Access
OPERAND ADDRESSING MODES
Operand addressing modes specify a variety of different ways that we can store and compute a memory address for use in fetching an operand value.
This is one of the most confusing parts about 6502 emulation!
(I apologize for the next slide!) Mode type Operand syntax Operation
ADC @(0x4400)
ADC @(0x4400+X)
ADC @(0x4400+Y)
ADC 0x42
ADC @(0x44)
ADC @(0x44+X)
ADC @(0x44+Y)
ADC @@(0x4400)
ADC @@(0x44+X)
ADC @(@(0x44)+Y) Addressing Modes — Absolute
Operand is the value located ADC @(0x4400) at address 0x4400
Operand is the value located ADC @(0x4400+X) at address 0x4400+X
Operand is the value located ADC @(0x4400+Y) at address 0x4400+Y Addressing Modes — Immediate
ADC 0x42 Operand is the value 0x42 Addressing Modes — Zero Page
Operand is the value located ADC @(0x44) at address 0x0044 (in the zero page)
Operand is the value located ADC @(0x44+X) at address 0x0044+X (in the zero page)
Operand is the value located ADC @(0x44+Y) at address 0x0044+Y (in the zero page) Addressing Modes — Indirect
Fetch an address from ADC @@(0x4400) 0x4400, use it to fetch operand.
Fetch an address from ADC @@(0x44+X) 0x0044+X, use it to fetch operand.
Fetch an address from ADC @(@(0x44)+Y) 0x0044, add Y, use result to fetch operand. JMP Instruction
JMP absolute addressing version indirect addressing version CPU Cycle for JMP (Absolute Addressing)
JMP 0x4400 // set PC to 0x4400 01000100 0xC002 00000000 0xC001 PC 01001100 0xC000
Register state:
PC: 0xC000
Pipeline state:
Opcode: 0x00 Cache: 0x0000 CPU Cycle for JMP (Absolute Addressing)
JMP 0x4400 // set PC to 0x4400 01000100 0xC002 PC 00000000 0xC001 Step 1 (b) PC 01001100 0xC000
Register state:
(a) PC: 0xC001
Pipeline state:
Opcode: 0x4C Cache: 0x0000 CPU Cycle for JMP (Absolute Addressing)
JMP 0x4400 // set PC to 0x4400 PC (b) 01000100 0xC002 00000000 0xC001 Step 1 PC 01001100 0xC000
Step 2 Register state:
(a) PC: 0xC003
Pipeline state:
Opcode: 0x4C Cache: 0x4400 CPU Cycle for JMP (Absolute Addressing)
Step 3 Register state:
PC: 0xC003
Pipeline state:
Cache: 0x4400
SNIPPET 4 — EXECUTE JMP OPCODE 342 343 void _execute_opcode_jmp(uint16 cached_address) { 344 registers.pc = cached_address; 345 } 346 347 CPU Cycle for JMP (Absolute Addressing)
Step 3 Register state:
PC: 0x4400
Pipeline state:
Cache: 0x4400
SNIPPET 4 — EXECUTE JMP OPCODE 342 343 void _execute_opcode_jmp(uint16 cached_address) { Set PC to our target address. Next CPU 344 registers.pc = cached_address; cycle will begin execution at this address. 345 } 346 347 CPU Cycle for JMP (Indirect Addressing)
JMP @(0x4400) // set PC to the address stored at 0x4400
JMP
→ → Hazard: 6502 Indirect Address Bug
JMP
you may need to replicate buggy platform behavior Hazard: 6502 Indirect Address Bug
0x01FF
we would Page 1 expect our loaded address to be 0x004A
high byte fetch 01000100 00000000 0x0100 01000110 0x00FF low byte fetch 0x004A Page 0
00100010 0x0000 Hazard: 6502 Indirect Address Bug
0x01FF
Page 1
Thus we expected a value of 0x004A, but received a value of 0x224A. 01000100 00000000 0x0100 01000110 0x00FF low byte fetch 0x224A Page 0
erroneous high byte fetch 00100010 0x0000 Little Endian vs. Big Endian
Another important detail
This is known as little endian order.
big endian order
X86, X64 (i.e. Intel, AMD, etc.), 6502 Little endian
Power PC, 68000, several game consoles Big endian
ARM, MIPS Both (configurable)
(Homework: think of the pros/cons to each ordering method) Achievement Unlocked: CPU Emulation
interrupts cycle count PPU Emulation Emulating the PPU: Inside the 2C02
●
●
● Innovative memory architecture
● Supported detailed graphics heavy reuse of low fidelity graphics data Memory Efficiency
0.5 KB
● ● Principles of PPU Emulation
five key concepts Background vs. Foreground
+ = Background vs. Foreground
Background where how colors
Foreground PPU Operation PPU Operation
frame buffer nametable
our frame is all black by default PPU Memory Map
0x3F20 — 0x3FFF
Palette RAM (32 B) 0x3F00 — 0x3F1F Where is our frame buffer stored?
0x3000 — 0x3EFF
PPU RAM (4 KB) 0x2000 — 0x2FFF (2 KB mirrored)
CHR ROM (8 KB) 0x0000 — 0x1FFF
Object Attributes (256 B) 0x00 — 0x0F PPU Memory Map
0x3F20 — 0x3FFF
Palette RAM (32 B) 0x3F00 — 0x3F1F 0x2C00 — 0x2FFF Nametable 3 (1 KB)
0x3000 — 0x3EFF 0x2800 — 0x2BFF Nametable 2 (1 KB)
PPU RAM (4 KB) 0x2000 — 0x2FFF 0x2400 — 0x27FF Nametable 1 (1 KB) (2 KB mirrored)
CHR ROM (8 KB) 0x0000 — 0x1FFF 0x2000 — 0x23FF Nametable 0 (1 KB)
Object Attributes (256 B) 0x00 — 0x0F PPU Memory Map
0x3F20 — 0x3FFF
Palette RAM (32 B) 0x3F00 — 0x3F1F 0x2C00 — 0x2FFF Nametable 3 (1 KB)
0x3000 — 0x3EFF 0x2800 — 0x2BFF Nametable 2 (1 KB)
PPU RAM (4 KB) 0x2000 — 0x2FFF 0x2400 — 0x27FF Nametable 1 (1 KB) (2 KB mirrored)
CHR ROM (8 KB) 0x0000 — 0x1FFF 0x2000 — 0x23FF Nametable 0 (1 KB)
Object Attributes Our frame buffer is here, in the low 1 KB of (256 B) 0x00 — 0x0F our PPU RAM.
We’ll talk about the additional nametables later, when we discuss scrolling. PPU Memory Map
256x240 256 4 0 PPU Memory Map
256x240 256 4 0 8x8 pixels represented by one references to tiles tile reference in our framebuffer PPU Memory Map
256x240
references to tiles
960 tile references 960 bytes
32x30 tiles 960 bytes total Tile Based Rendering
exactly Tile Based Rendering
tiles, which are reusable image patterns
● Background tile patterns
● Foreground sprite patterns PPU Memory Map
Each 8x8 tile requires 16 bytes of memory
0x3F20 — 0x3FFF
0x3F00 — 0x3F1F
0x3000 — 0x3EFF
0x2000 — 0x2FFF PPU RAM (4KB) games typically supported up to 256 (2KB mirrored) different background tile patterns 0x0000 — 0x1FFF CHR ROM (8KB) Anatomy of a CHR tile
output 256x240 image pixels output 8x8 image pixels Anatomy of a CHR tile
00 00 00 00 00 00 00 11
00 00 00 00 00 00 11 10
00 00 00 00 00 11 10 10
00 00 00 00 11 10 10 10
00010000 00 00 00 11 10 10 10 10
00 00 11 10 10 10 10 10
00 11 10 10 10 10 10 10
11 10 10 10 10 10 10 10
frame buffer tile reference referenced CHR tile from CHR ROM Anatomy of a CHR tile
00 00 00 00 00 00 00 11 not directly hold a color value it holds an index into a color palette 00 00 00 00 00 00 11 10
00 00 00 00 00 11 10 10
00 00 00 00 11 10 10 10
00 00 00 11 10 10 10 10
00 00 11 10 10 10 10 10 00 11 10 10 10 10 10 10 00 01 10 11 11 10 10 10 10 10 10 10 Anatomy of a CHR tile
00 00 00 00 00 00 00 11 00 00 00 00 00 00 00 11
00 00 00 00 00 00 11 10 00 00 00 00 00 00 11 10
00 00 00 00 00 11 10 10 00 00 00 00 00 11 10 10
00 00 00 00 11 10 10 10 00 00 00 00 11 10 10 10
00 00 00 11 10 10 10 10 00 00 00 11 10 10 10 10
00 00 11 10 10 10 10 10 00 00 11 10 10 10 10 10
00 11 10 10 10 10 10 10 00 11 10 10 10 10 10 10
11 10 10 10 10 10 10 10 11 10 10 10 10 10 10 10 64 colors in total
Notice that this color space wastes 10 values for black, lacks a true yellow, and has a fairly poor grayscale (making transitions and fades difficult) 64 colors in total
00 01 10 11 Further palette limitations:
●
● ○ ○ Further palette limitations:
●
● ○ ○
common background transparent PPU Palette Storage
0x3F20 — 0x3FFF
Palette RAM (32 B) 0x3F00 — 0x3F1F
0x3000 — 0x3EFF
PPU RAM (4 KB) 0x2000 — 0x2FFF (2 KB mirrored)
CHR ROM (8 KB) 0x0000 — 0x1FFF
Object Attributes (256 B) 0x00 — 0x0F PPU Palette Restrictions
One final palette limitation meta-tile
meta-tile boundary tile boundary Tile Based Rendering
30 tiles used in the frame, requiring 480 bytes in the CHR ROM
22 background tiles (352 bytes)
8 sprite tiles (128 bytes) Tile Based Rendering
22 unique tiles used in the frame, requiring 352 bytes in the CHR ROM
22 15 background tiles (240 bytes)
!
8 7 sprite tiles (112 bytes)
Anatomy of a Sprite
Sprites are actually fairly similar to background tiles, but with a few exceptions:
● ● ● ● scanline ● attributes
four separate 8x8 pixel sprites Anatomy of a Sprite
0x3F20 — 0x3FFF
Sprites are 4 bytes each, so we can 0x3F00 — 0x3F1F Palette RAM (32 B) store up to 64 concurrent sprites in 0x3000 — 0x3EFF our 256 byte object attribute memory 0x2000 — 0x2FFF PPU RAM (4 KB) (2 KB mirrored)
0x0000 — 0x1FFF CHR ROM (8 KB)
0x00 — 0x0F Object Attributes (256 B) Anatomy of a Sprite
typedef struct sprite_desc { uint8 sprite_y; // sprite y pixel coordinate uint8 tile_reference; // indicates tile pattern to use for rendering uint8 attributes; // attributes (more on this next) uint8 sprite_x; // sprite x pixel coordinate } sprite_desc;
● ● ● Anatomy of a Sprite
V H M U U U P P
P palette select, uses two bits to select one of four 4-color sprite palettes U unused — never read, hopefully never written M background mask, indicates if sprite is behind the background H flip sprite horizontally V flip sprite vertically Anatomy of a Sprite
tile reuse Anatomy of a Sprite
even more tile reuse PPU Sprite Layering
● ● ● PPU Sprite Layering
object attribute memory secondary OAM
for each scanline: clear secondary OAM scan the sprites in the primary OAM (0 -> 63): if a sprite intersects the current scanline, write it into the secondary OAM scan the sprites in secondary OAM in reverse order (# sprites - 1 -> 0): render the sprite into the current scanline Megaman II (and others) “worked around” the 8 sprite per scanline limitation by reordering sprites each frame. This resulted in noticeable sprite flicker, but enabled some incredibly dynamic gameplay. Background Scrolling
smooth scrolling backgrounds Smooth Scrolling on a PC
The first widely known demo of a smooth this was a significant competitive advantage for the NES scrolling background on the PC was Dangerous Dave in Copyright Infringement (1990), which recreated a classic level from Super Mario Bros. 3.
It was created by John Carmack and Tom Hall, who would ultimately found the id Software game studio (with credits including Commander Keen, DOOM, and Quake series). Background Scrolling
No scrolling Horizontal scrolling Vertical scrolling Horizontal + vertical scrolling No scrolling: all of the action happens on a single screen. Horizontal scrolling: background pans horizontally to offer exploration of a larger world. Vertical scrolling: background pans vertically to offer exploration of a larger world. Full scrolling: background pans vertically and horizontally to offer exploration of a larger world. Nametable scrolling
● ● ● and ●
0x2800 - 0x2BFF 0x2C00 - 0x2EFF
0x2000 - 0x23FF 0x2400 - 0x27FF Nametable scrolling
● ● ● and ●
0x2800 - 0x2BFF 0x2C00 - 0x2EFF Mirrors
0x2000 - 0x23FF 0x2400 - 0x27FF Our 2 KB RAM Nametable scrolling
0x2000 0x2400 0x2800 0x2C00
nametable 0 nametable 1 nametable 0 mirror Nametable scrolling Scroll X: 0x2000
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror Nametable scrolling Scroll X: 0x2100
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror
No longer visible, so we update with new tiles (and reflected in our mirror) Nametable scrolling Scroll X: 0x2300
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror
No longer visible, so we update with new tiles (and reflected in our mirror) Nametable scrolling Scroll X: 0x2400
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror
nametable 0 now fully reset, ready to present fresh background Nametable scrolling Scroll X: 0x2500
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror
leveraging address mirroring for smooth nametable traversal Nametable scrolling Scroll X: 0x2700
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror
leveraging address mirroring for smooth nametable traversal Nametable scrolling Scroll X: 0x2800
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror
nametable 1 fully updated with fresh background Nametable scrolling Scroll X: 0x2000
0x2000 0x2400 0x2800 0x2C00
SCREEN
nametable 0 nametable 1 nametable 0 mirror
reset scroll x to 0x2000 once we complete a scroll cycle Concept Review
Five key concepts
✔ ✔ ✔ ✔ ✔ Input Basic Controller Support
● 0x4016 0x4017
●
●
Right Down Select
7 6 5 4 3 2 1 0
Left Up Start B A Part III
Advanced Topics Security
Why Security Matters (to the platform manufacturer)
Integrity of the platform ● ●
Integrity of the business ● ○ ○
Terminology: ● → ● → Security Model Goals
Prevent users from running arbitrary code ⇒
Prevent users from running duplicated code ⇒ Platform Security 101
Strategies:
● Platform verification → ○
● Application verification →
Common tactics: Platform Security 101
Strategies:
● Platform verification → ○
● Application verification →
Common tactics:
These are particularly challenging for emulation know that their security model will be breached Optimizations Just-in-time Compilation
With this design, each 6502 opcode will require many host opcodes
convert 6502 opcodes into host opcodes feed them directly into the host processor Just-in-time Compilation
● Just in time ○ ○
● Ahead of time ○ ○
Caveat: Graphics Hardware Acceleration
● ○ ○ ○ ■
●
○
● try to pre-process graphical data ahead of time Multi-threaded Emulation
Lots of processing can be done in parallel
Multi-threaded emulation is often a necessity when targeting a multi-core platform PPU Tricks Squeezing the PPU Squeezing the PPU
Animated backgrounds Squeezing the PPU
Scanline scrolling backgrounds Squeezing the PPU
Animated palettes Squeezing the PPU
Spatial dithering
Debugging Debugging
● Virtual CPU debugger ○ ○ ■ ■ ○
● Virtual PPU/GPU debugger ○ ○
● General ○ ○ ○ ○ Going Further NES Components We Didn’t Cover
Processors —
Memory —
Audio —
Storage —
Input —
Output — Supporting Modern Systems
Processors —
Graphics —
Audio —
Network connectivity —
Convenience —
Storage —
Input —
Performance —
(and much more) Wrap Up Emulation is fun.
Emulation is often hard.
Emulation is about much more than games! Thanks for listening!
Grab the source to my NES emulator at:
https://github.com/ramenhut/simpleNES
Enjoyed the lecture? Check out more of my lectures at https://www.bertolami.com! Bonus: Top 5 NES Games by Sales
Super Mario Bros. Duck Hunt Super Mario Bros. 3 Super Mario Bros. 2 Legend of Zelda Sales Rank: #1 Sales Rank: #2 Sales Rank: #3 Sales Rank: #4 Sales Rank: #5 40.2M Units Sold 28.3M Units Sold 18M Units Sold 7.4M Units Sold 6.5M Units Sold The Zapper Bonus: How Did the Zapper Work? Bonus: How Did the Zapper Work?
Let’s take a look at how this worked with Duck Hunt. Bonus: How Did the Zapper Work?
They aim the zapper at the duck and pull the trigger.
They hear a gunshot sound, and see a flash Bonus: How Did the Zapper Work?
Frame 1: baseline Frame 2: light up the targets Frame 3: back to normal
light detected no light detected no light detected, Bonus: How Did the Zapper Work?
the system can render the multiple hit targets (white boxes) at different points in time
the zapper generally doesn’t work on modern displays due to the high amount of video processing delay