Norwegian University of Science and Technology

Demolicious A General-Purpose Graphics Processing Computer

TDT4295 Computer Design Project

Authors:

Aleksander Gustaw Wasaznik Aleksander Vognild Burkow Kristian Fladstad Normann Christoffer Tønnessen Andreas Løve Selvik Julian Ho-yin Lam Eirik Jakobsen Stian Jensen Torkel Berli

November 25, 2014 [This page is intentionally left blank.]

i Abstract

Demolicious is a general purpose SIMT inspired computer. It is specialized to handle parallel rendering of computer graphics. The performance of the system scales linearly with the amount of cores. In addition, this system is the first of its kind in this project with an HDMI output. Computer generated graphics was chosen as a showcase application, and was successfully demonstrated on an HDMI enabled screen. It features a multi-core architecture, and is designed to maximize core utilization by hiding memory latency. The system has been implemented on a PCB, using a backup oriented design philosophy, and has passed testing and verification.

ii Acknowledgements

A huge thank you goes out to the following people for tremendous help during the project: Gunnar Tufte for guiding us towards a finished, working product. Yaman Umuroğlu for great project coordination and answers to questions at close to any hours of the day. Odd Rune S. Lykkebø for fantastic email response time, anywhere in the world. Omega Verksted for help with electronic circuits, as well as being able to provide endless amounts of headers at 4 a.m. Elster’s fridge for storing our food throughout the semester. Geir Kulia for providing us with dinner when we were too swamped to leave the lab.

iii Contents

I Introduction1

1 Computer Design Project2 1.1 Assignment...... 2 1.1.1 Original Assignment Text...... 3 1.2 A Parallel Processing Accelerator...... 4 1.3 Structure of the Report...... 4

2 Modern Graphics Processing Units5 2.1 What is a GPU...... 5 2.2 The GPU Market ...... 5 2.3 An Enormous Task...... 6 2.4 Taking Advantage of Parallelism...... 6 2.5 At The Mercy of Memory...... 7 2.6 Shortcomings of a GPU...... 7

3 NVIDIA’s Fermi Architecture and CUDA Programming Model9 3.1 The Fermi Architecture...... 9 3.2 CUDA Programming Model...... 9 3.3 Streaming Multiprocessors ...... 11 3.4 Warps of Threads ...... 12 3.5 Helpful Inspiration...... 12

4 Demolicious 13 4.1 A Graphical Demo Machine...... 13 4.2 Demolicious Requirements...... 14

II Solution 15

5 System Overview 16 5.1 Logical System Structure ...... 16 5.2 Programming Model...... 17 5.2.1 Executing Kernels ...... 17 5.2.2 More Advanced Kernels...... 19

6 CPU 21 6.1 Functionality...... 21 6.2 Communication with the GPU...... 21 6.3 Loading a Kernel...... 22 6.4 Running a Kernel ...... 23

iv CONTENTS

6.5 Loading parameters ...... 24 6.5.1 Summary...... 25

7 GPU 26 7.1 Responsibilities...... 26 7.2 Architecture Overview...... 27 7.3 Receiving a Kernel Call...... 27 7.4 Running a Kernel ...... 28 7.4.1 Warps...... 28 7.4.2 Static Scheduling...... 28 7.5 Module Details...... 31 7.5.1 Processor Core...... 31 7.5.2 Spawner ...... 32 7.5.3 Register Directory ...... 34 7.5.4 SRAM Arbiter ...... 36 7.5.5 Load/Store Unit ...... 36 7.5.6 HDMI...... 37 7.5.7 Summary...... 38

8 Physical Implementation 39 8.1 Backup oriented design ...... 40 8.2 Input ...... 41 8.3 Output ...... 41 8.3.1 HDMI implementation...... 41 8.4 Power...... 41 8.5 Bus ...... 42 8.6 Clocks...... 42 8.7 Main Components...... 42

9 Additional Tools 44 9.1 Assembler...... 44 9.2 Simulator...... 44

III Results & Discussion 45

10 Testing 46 10.1 GPU Component testing ...... 46 10.2 VHDL system integration tests...... 47 10.2.1 Memory Stores and Kernel Parameterization ...... 48 10.2.2 Predicated Instruction Execution...... 49 10.2.3 Loads from Primary Memory...... 50 10.3 Verification of Communication Channels...... 52 10.3.1 JTAG test...... 52 10.3.2 EBI Bus...... 53 10.3.3 HDMI Output ...... 53 10.3.4 SRAM Communication ...... 53 10.4 PCB tests...... 53 10.4.1 Solder, signal and power test ...... 54 10.4.2 Oscillator and clock test...... 54

v CONTENTS

11 Results 55 11.1 Scalability of the Demolicious system ...... 56 11.2 Performance ...... 56 11.3 Video output...... 58 11.4 Single vs double buffering...... 59 11.5 Demolicious Power efficiency ...... 60

12 Discussion 63 12.1 Energy Efficiency...... 63 12.1.1 Optimizing FPGA power usage...... 63 12.1.2 Physical Implementation...... 64 12.2 Programming Challenges ...... 64 12.3 SRAM - Alternative Memory Layouts...... 65 12.4 Multiple clock domains in HDMI unit...... 65 12.5 Troubleshooting and Workarounds...... 65 12.6 Hardware components...... 68 12.6.1 Clocks...... 69 12.7 Budget ...... 70

13 Conclusion 71 13.1 Further Work...... 71 13.1.1 Architecture ...... 71 13.1.2 Physical design...... 72 13.1.3 Compiler ...... 72

List of Tables 74

List of Figures 74

IV Appendices 78

A Expenses 79 A.1 PCB manufacturing ...... 80 A.2 Component purchases...... 81

B Instruction set architecture 82 B.1 Registers ...... 83 B.2 Predicated instructions ...... 84 B.3 Instructions...... 85 B.3.1 R-type Instructions...... 85 B.3.2 I-type Instructions...... 86 B.3.3 Pseudo Instructions ...... 86

C Commented tunnel kernel 87

D PCB schematics 91

vi Part I

Introduction

1 Chapter 1

Computer Design Project

This report presents the TDT4295 Computer Design Project at NTNU for the fall semester of 2014. The course is held every fall, and consists of a single task in which a group of students make a working computer from scratch. This report’s group was made out of 9 students from the Computer Science department. Gunnar Tufte and Yaman Umuroğlu served as advisors for the group throughout the semester and assisted in administrative tasks.

1.1 Assignment

The Computer Design Project’s primary tasks included making a custom printed circuitboard (PCB) and implementing a custom processor architecture on an FPGA (Field-Programmable Gate Array). Together with a (MCU) and a choice of I/O components, these were to form a complete and working system. The project is evaluated based on this report and an oral presentation of the work, as well as a prototype demonstration. The specific assignment for this year was to create a processor inspired by GPU architectures. Core requirements included having multiple processor cores and a graphical display output.

2 CHAPTER 1. COMPUTER DESIGN PROJECT

1.1.1 Original Assignment Text

1.1.1.1 Construct a graphics processing unit (GPU) inspired processor GPUs play a large role in graphical applications as well as high performance computing. They are typically constructed around the SIMD (single instruction multiple data) paradigm and include spe- cial hardware for accelerating graphics-related operation. The idea is to make a GPU-inspired processor architecture that exploits the possibility of parallel computation on a single chip. The GPU must be a multi-core system.

1.1.1.2 Additional requirements Your processor will be implemented on an FPGA, and you are free to choose how to realize your computer architecture. Studying the architecture of general multi-core processors and parallel machines options can be a good starting point.

Energy efficiency should be a primary consideration in all phases of the project, from early design decisions to how software is written.

The task should also include a suitable application that can produce a graphical output on a display to demonstrate the processor.

The unit must utilize a Silicon Labs EFM32 series microcontroller (to act as an I/O processor) and a Xilinx FPGA (to implement your architecture on). The budget is 10 000 NOK, which must cover components and PCB production. The unit design must adhere to the limits set by the course staff at any given time.

3 CHAPTER 1. COMPUTER DESIGN PROJECT

1.2 A Parallel Processing Accelerator

The group decided to make the custom GPU inspired processor as an acceler- ator processor for parallelizable computations. This means that it would not be designed to run entire programs on its own, but would instead be tasked with particular parallelizable parts of a program. Its purpose would primar- ily be to run graphics related operations. However, it would be designed to be programmable with general arithmetic and logic operations, and not have specialized hardware units for performing graphics operations. The group was determined to also send graphics output directly from the accelerator to screen, as is common for a GPU. The accelerator design needed be implemented on an FPGA. The FPGA was to be mounted on a PCB together with a microcontroller, memory, and HDMI port, and form a graphical computer system similar to how modern PCs are organized.

1.3 Structure of the Report

The accelerator processor will fill the role of a modern GPU in the system presented by this report. Therefore, it will be referred to in this report as the GPU. Having introduced the problem, the report will continue by giving a short in- troduction to what a GPU is and their design. It will present concepts and challenges in graphical computations and GPU design. These topics help to appreciate the need for a GPU in modern computers, and to understand the trade-offs and optimizations involved in its design. The system created in this project, and its purpose, will be introduced in light of modern GPU concepts. In the solution part, a detailed explanation of the system will be given by roughly following the path of an executing program, going from high to low levels of abstraction. The physical product produced in this project, and the test and verification methods involved in making it work, will also be presented. The results chapter will go through which parts of the system worked and which did not, and will look at measurements of its performance. Lastly, the discussion chapter will comment on some of the difficulties with the completed system, and on topics such as possible further work. Note that the reader of this report is expected to have a basic understanding of logic design, computer components, and programming.

4 Chapter 2

Modern Graphics Processing Units

2.1 What is a GPU

Modern GPUs are, in a way, an evolution of former Video Graphics Array (VGA) controllers [2]. A VGA controller of the early 1990s served as a and display generator that wrote framebuffer values to a display. As technology advanced, it received hardware to perform specific graphics related functions. This eventually evolved into a processor, with its own memory, that incorporated a full set of graphical functions. A GPU’s primary purpose has traditionally been to offload graphical calcula- tions from the CPU and render images to a screen. Graphical functions are ac- cessed through APIs like DirectX and OpenGL. Today, GPUs also have general computing capabilities and may serve as co-processors for the CPU in addition to handling their graphical duties. Non-graphics applications for a GPU include image processing, video encoding, and many scientific computing problems and other large, highly regular calculations.

2.2 The GPU Market

In general, at least one GPU is present in every PC these days, and the market for PCs has been relatively stable for many years [5]. These GPUs can take the form of discrete chips or be integrated with the CPU. Intel, AMD, and NVIDIA are the big actors in the PC GPU market [14]. Intel is largest overall but only makes GPUs integrated with their CPUs. AMD and NVIDIA share the discrete GPU market [13]. One of NVIDIA’s GPU architectures will be used as an example later. A big market for the more powerful discrete GPUs is the computer gaming industry. This market has contributed a great deal to the rapid progression of graphics technologies. With the advent of general purpose GPUs, scientific computing has also gained an interest in powerful GPUs.

5 CHAPTER 2. MODERN GRAPHICS PROCESSING UNITS

In the past 6 years, the booming market for mobile smart devices has introduced a new arena for GPUs. Mobile devices are currently outselling workstations, and every new mobile device sold today is shipped with a small-format GPU. These GPUs are generally integrated in the device’s system-on-a-chip (SoC), and have slightly different design considerations than traditional workstation GPUs. As for everything else that runs on batteries, power consumption is a primary concern in this format. Traditionally, this is a concern that GPU design has not needed to prioritize. The mobile GPU format has opened up for other GPU design companies than the three mentioned above. The three dominant GPU design companies in the SoC market are Qualcomm, ARM, and Imagination Technologies [15].

2.3 An Enormous Task

Producing computer graphics is a highly processing intensive task. Consumers expect their computers to display videos and games in high resolutions and at high framerates. To color one pixel accurately in a 3D environment, the processor typically needs to calculate vectors in 3D space, interpolate texture data, adjust for light intensity, and more. All of these tasks require several demanding floating point operations. A CPU is optimized for serial programs, and will compute a picture frame in such a fashion. The problem is that, even with the processing power of 4 GHz, it is not able to handle the amount of calculations well enough for modern graphical demands. Fortunately, pixels can often be computed independently of each other. A CPU fails to take advantage of this.

2.4 Taking Advantage of Parallelism

For a CPU, latency is of primary concern because the next instruction often depends on the result of the previous one. Therefore, the CPU needs to com- plete each instruction as quickly as possible. This makes the CPU very good at problems with a low level of parallelism, which after all characterizes most programs. Finishing an instruction as quickly as possible is less of a concern when problems are more parallel in nature. A GPU therefore optimizes for throughput instead of latency. It gains throughput by making its architecture highly parallel in all respects. The fast single-threaded cores in CPUs are replaced with a large quantity of smaller, slower cores. A GPU gains throughput by spending resources on ex- ecution units instead of more , prediction logic, or dynamic reordering logic to make one very effective. It also exploits parallelism with its deep pipelines, executing many instructions concurrently within each core. These architectural decisions cause each individual thread to require more wait cycles to wait for memory access or pipeline hazards. Fortunately, a GPU is not concerned about each individual thread’s tardiness, and instead fills these wait cycles by interleaving many threads at once in each core. Filling every cycle

6 CHAPTER 2. MODERN GRAPHICS PROCESSING UNITS and every pipeline stage with useful work is, of course, essential in optimizing throughput.

2.5 At The Mercy of Memory

A GPU can achieve a very high throughput because of its massive amount of parallelism. But with great computing power comes great memory demand. Keeping all the computational units in a GPU supplied with enough data is a difficult challenge. Using NVIDIA’s top-end GPU in 2006, GeForce 8800, as an example, it can process 32 pixels per clock at 600 MHz [2]. A pixel will usually be 4 bytes in size. Calculating its value will typically require a read and write of color and depth, and a few texture element reads. So, on average, there may be a demand of 7 memory accesses of 4 bytes each per pixel. To sustain 32 pixel values per clock, this requires up to 896 bytes per clock, or 537 GB/s, in memory transfers. Many techniques are used by the memory system to meet such demanding re- quirements. First of all, the memory system consists of many different types of memories that are optimized for different access patterns. Some data must be globally available to all threads, others are local to one or a handful of threads. Some data is read only, some data is accessed with a high spatial locality in 3D space, and so on. Tailoring the memory system to suit important and commonly used access patterns is key to high performance. Some examples of memory types that NVIDIA use in their GPU architectures are texture memory, constant memory, local memory, and shared memory. The different types of memories are usually organized into several separate banks, often with their own memory controller, so that more requests can be handled at a time. Interleaving consecutive virtual addresses over different physical memory banks will assist in spreading requests evenly across the set of banks. All memories typically have multiple levels of cache associated with them as well. Other techniques to improve memory bandwidth include data compression to reduce the amount of bits to transfer, and bundling memory accesses into sets that the memory system handle well. Bundling adds latency to memory requests. But, as mentioned earlier, this is not an issue as the GPU has plenty of other work to do while a particular thread waits.

2.6 Shortcomings of a GPU

A GPU obviously does its parallel calculations effectively, but it comes at a cost. It gains throughput over a CPU in part by sacrificing low latency in individual instructions. As we have seen, this happens as a result of deep pipelines and delayed memory requests, among other things. Such architectural features re- duce the performance of serial programs, especially ones with a lot of branching. Branching instructions on a deeply pipelined processor is heavily punished by control hazards.

7 CHAPTER 2. MODERN GRAPHICS PROCESSING UNITS

Since most programs are not particularly parallelizable and commonly contain branching, a GPU is insufficient for running an average program by today’s standards. Computer users today are accustomed to the performance provided by the cooperation of a CPU and a GPU, a so-called heterogeneous computer system. This is so essential that no computer, or even mobile unit, with a graphical display is produced today without the combination of both.

8 Chapter 3

NVIDIA’s Fermi Architecture and CUDA Programming Model

To explore some concepts of modern GPU architectures in detail, this chapter will take a look at an example from NVIDIA. The programming model used by NVIDIA for its modern GPUs, called CUDA, will be be examined. Then, NVIDIA’s first architecture to use this programming model, the Fermi archi- tecture, will be used as the example for illustrating GPU design concepts. The system presented in this report is heavily inspired by NVIDIA, and the following concepts are relevant for both its programming model and its architecture.

3.1 The Fermi Architecture

In 2006 NVIDIA released their first GPUs with a so called unified shader ar- chitecture. Before this, GPUs had separate, dedicated hardware for all common graphics operations. Beginning with the Fermi architecture, the majority of operations are executed on the same hardware. The hardware is able to per- form general operations. This significant difference compared to earlier GPU has enabled other calculations on a GPU than graphics pro- cessing. This ability been given the term General-Purpose GPU computing, or GPGPU.

3.2 CUDA Programming Model

To expose the computing powers of the graphics card for general applications, there was a need for a new application programming interface (API) that could be used to run code on the GPUs. It was possible to do non-graphical calcula- tions on GPUs previously, but it required redefining any problem as a graphics

9 CHAPTER 3. NVIDIA’S FERMI ARCHITECTURE AND CUDA PROGRAMMING MODEL

problem. This was highly impractical. With the Fermi architecture, NVIDIA released a framework for running code on their GPUs called CUDA. CUDA is an extension to the programming language C, and it lets the pro- grammer write functions for executing on the GPU. Such a function is called a kernel. An example will be used to illustrate how a kernel executes on the Fermi architecture. Let’s say you want to fill a frame buffer, an area of memory, with the color green. In a sequential programming model, you would typically write a loop that would fill the memory locations with the value for green one by one. See Listing 3.1.

1 int green=0x00FF00; 2 for (int i = 0;i< nr_of_pixels;i++){ 3 framebuffer[i] = green; 4 }

Listing 3.1: A sequential program filling the screen with green

For a CUDA kernel, on the other hand, same program would be written as if filling only a single pixel with green. The kernel would then be executed by one thread for every pixel on the screen. Each thread would receive a unique id number corresponding to the ’i’ value from the serial example. Below, in Listing 3.2, is an example CUDA kernel that fills a single memory location, or pixel, with green.

1 __global__ void fill_screen(int* framebuffer){ 2 /* Calculate global id of this thread */ 3 int global_id= blockIdx.x * blockDim.x+ threadIdx.x; 4 int green =0x00FF00; 5 framebuffer[global_id] = green; 6 }

Listing 3.2: A CUDA kernel filling a single pixel with green

The CPU will need to load a kernel into the GPU at run time before calling it. When calling a kernel, the CPU specifies how many threads to run. Man- aging memory, uploading kernels, and calling kernels on the GPU is done using CUDA specific syntax. Listing 3.3 shows example CUDA code for calling the "fill_screen" kernel.

1 int* frame_buffer_device= cudaMalloc(data_size); // allocate memory at GPU 2 fill_screen<<<1080,1920>>>(frame_buffer_device); // call kernel

Listing 3.3: Starting the CUDA kernel with one thread per pixel on a 1920x1080 screen

Running a separate thread for each small task, like in CUDA, has another more general term, SIMT. SIMT is short for Single Instruction Multiple Threads. It is a common programming paradigm for modern GPUs because it a very flexible way of writing parallel programs. Its programs are easily portable and scalable

10 CHAPTER 3. NVIDIA’S FERMI ARCHITECTURE AND CUDA PROGRAMMING MODEL because they ignore the underlying architecture of the machine.

3.3 Streaming Multiprocessors

Figure 3.3.1: Streaming Multiprocessor

A CUDA function typically creates many, many threads. To handle all these, the Fermi architecture has a large number of cores organized into a hierarchy of groups that share different amounts of extra resources. At its heart lies 16 Streaming Multiprocessors (SMs). A Streaming Multiprocessor is a collection of tightly coupled Streaming Proces- sor cores (SP) along with some shared accessories. To enable a large number of cores, NVIDIA makes each of the SP cores simple. For example, common but difficult functions like sin and reciprocal are delegated to a few separate hardware units within the SM, called Special Function Units (SFU). In Fermi, each SM contains 32 SP cores. Each core is capable of doing both inte- ger and floating point operations. Their shared accessories include four Special Function units, 16 Load/Store units, and 64KB SRAM for shared memory and shared cache. A simplified diagram of a smaller SM is shown in Figure 3.3.1.

11 CHAPTER 3. NVIDIA’S FERMI ARCHITECTURE AND CUDA PROGRAMMING MODEL

Figure 3.3.2: Warps of 32 threads execute

3.4 Warps of Threads

Threads in Fermi are grouped together in logical units of up to 32 threads. This corresponds to the 32 Streaming Processor cores in a Streaming Multiproces- sor. The 32 threads are collectively called a warp. Threads in the same warp always execute the same instruction at the same time. This way, only a single instruction fetch operation per warp is required for the SM. Many warps may be active in an SM at any time, and each thread in each warp has its own set of registers within the SM. This enables fine-grained interleaving of warps without needing expensive context switches with every warp switch. A simplified example execution order is shown in figure 3.3.2. Scheduling warps is difficult, but it is important for filling the pipeline with useful instructions, as discussed in chapter2 As noted, all instructions in the same warp runs the same instruction. After a conditional statements, however, different threads of the same warp will likely need to perform different instructions. This is the concept of thread divergence. NVIDIA’s GPUs support dynamic thread divergence by moving threads between warps as they diverge.

3.5 Helpful Inspiration

The concepts employed in NVIDIA’s Fermi architecture and CUDA program- ming model are, needless to say, effective for parallel calculations. They have been a source of inspiration for this project. Most these concepts, like kernels, warps, and easy context switches, will return later in the report. Although they return in simplified versions, they are helpful for exploiting parallelism.

12 Chapter 4

Demolicious

4.1 A Graphical Demo Machine

The system created in this project is named Demolicious. It is made for running graphical demo’s, which is the inspiration for its name. A graphical demo is a visually pleasing programming feat made to demonstrate the capabilities of the computer as well as the programmer. Demolicious is inspired by modern PCs’ CPU-GPU coordination, both in its programming model and its architectural design. The CPU will handle pro- grams by default, and offload parallelizable tasks to the GPU. It will run on a microcontroller. The GPU will handle parallelizable tasks, like the many graph- ical operations in a demo, and send graphical data to screen. Its architecture will be designed and implemented on an FPGA. Modern GPUs have long development cycles and are very complex. Demolicious’ GPU architecture is necessarily a greatly simplified version. Because of time constraints, many features that define modern GPUs had to be left out in its design. The GPU has no branching, and there are no caches. Modern GPUs have dynamic scheduling to better utilize the resources. The Demolicious GPU uses barrel processing as a static scheduling scheme, and to hide memory latency.

13 CHAPTER 4. DEMOLICIOUS

4.2 Demolicious Requirements

For Demolicious to support interesting demos, a set of basic requirements was listed for it. It certainly needed to display graphics, and a goal was to use HDMI for it. This decision was partially based on the excitement that it had not been done before in earlier Computer Design Projects, and partially on the fact that it is a modern technology. Driving video output from the custom GPU was important because a key point of the design was for it to mirror a modern GPU’s purpose in a computer system. Energy efficiency needed to be a primary concern throughout this project. In an experimental system like Demolicious, that concern will often be sidelined to making anything work in the first place. Energy efficiency is, after all, an optimization. Design philosophies like simplicity in GPU architecture and re- dundancy in PCB options collide somewhat with energy concerns, and both were essential in producing a working system in the short timespan of this project. One very significant energy optimization for Demolicious made the list of functional goals: setting the CPU to sleep mode when it is idle. The performance goals were set to be challenging but still reasonable, based on a very rough estimate of potential memory bandwidth and FPGA clock frequency. Writing interesting demos for the completed machine would be a bonus. A related bonus would be to write helpful tools to assist in writing the demos. Table 4.2.1 below lists the initial functional goals, together with priorities, that were decided upon for Demolicious.

Demolicious Functional Goals Priority Demolicious should be general purpose HIGH Demolicious should display graphics on screen HIGH Demolicious should let its CPU sleep while the GPU MEDIUM executes Demolicious should use HDMI for its graphics output MEDIUM Demolicious should drive video output from its GPU MEDIUM Demolicious should display a frame of 512 by 256 MEDIUM pixels Demolicious should handle an output rate of about MEDIUM 30 frames per second Demolicious should have an example application in LOW the form of a visual demo displayed on screen Demolicious should have a toolchain to make life eas- LOW ier for programmers

Table 4.2.1: Goals set for the Demolicious system

14 Part II

Solution

15 Chapter 5

System Overview

5.1 Logical System Structure

Figure 5.1.1: Logical overview of the system.

A conceptual overview of the Demolicious computer is depicted in figure 5.1.1. On a large scale, it is made up of a CPU and a GPU. In addition, an external memory chip is used for storing framebuffers for screen output. A dedicated HDMI module reads pixel data from the framebuffer and displays it on screen.

16 CHAPTER 5. SYSTEM OVERVIEW

5.2 Programming Model

Figure 5.2.1: Relationship between CPU and GPU code.

The programming model for Demolicious is heavily inspired by CUDA. So let’s see how we can implement the same program filling a framebuffer with green, as we did in the CUDA introduction in section 3.2. Below, in listing 5.1, the fillscreen kernel has been converted for the Demolicious programming paradigm. The first thing you’ll notice is that it is not written in C, but assembly. The complete introduction to the Demolicious instruction set and assembly programming can be found in appendixB.

1 ldi $data, 0b0000011111100000 2 mv $address_lo, $id_lo 3 mv $address_hi, $id_hi 4 sw 5 thread_finished

Listing 5.1: A simple kernel that fills the screen with the color green

Let’s walk through the kernel one line at a time. The first line uses the ldi instruction, which stands for load immediate. It loads the value 0b0000011111100000, which corresponds to the color green in the Demolicious color space, into the special register $data. The second and third line move the kernel’s thread ID into the address registers. Finally, the store instruction is executed, storing the value in the $data register to the address given by the two address registers. This means all pixels starting with address zero and up until the number of executed threads will get colored green. The thread stops running after executing the thread_finished instruction.

5.2.1 Executing Kernels

Now, how do we execute this kernel on the GPU? The program running on the CPU, referred to as the host program, has to upload the assembled kernel to

17 CHAPTER 5. SYSTEM OVERVIEW

the GPU, and then tell the GPU to run it. Kernels can be assembled using the provided assembler (section 9.1).

1 instruction_t fill_screen_kernel[] = { 2 0x08050000, // ldc $data, 0 3 0x00402004, // mv $address_lo, $id_lo 4 0x00201804, // mv $address_hi, $id_hi 5 0x10000000, // sw 6 0x40000000 // thread_finished 7 }; 8 9 kernel_t fill_screen= load_kernel(fill_screen_kernel); 10 11 run_kernel(fill_screen, 4096);

Listing 5.2: Loading and executing a kernel

In listing 5.2, we first see the assembled kernel stored in the fill_screen_kernel array. It is then uploaded to the GPU using the load_kernel function. A reference to the uploaded kernel is returned, which will be used when starting the kernel. In addition, the run_kernel function is also provided with the number of threads to spawn. Here, we spawn 4096 threads, enough to color 64*64 pixels green. While being able to make the screen green by running a specialized kernel is nice, it would require many similar kernels to color the screen in different colors. To improve on this, kernels can take parameters as input, which lets them be reused with varying output. The CPU can set these parameters to different values each time a kernel is run. For instance, the CPU can set the desired color as a parameter, and the kernel will store that value to memory instead of a predefined immediate. Listing 5.3 shows an example kernel where the color is stored as a parameter.

1 ldc $data, 0 2 mv $address_lo, $id_lo 3 mv $address_hi, $id_hi 4 sw 5 thread_finished

Listing 5.3: A kernel loading the color value from a parameter

The only changed line in this kernel is the first one, where instead of using an immediate value, we load a value using the ldc (load constant) instruction. The value is a constant from the kernel’s viewpoint, as it cannot be changed from the GPU. Instead, the value is set from the CPU using the load_constant function, as seen in listing 5.4.

1 load_constant(0, 0x001F); 2 run_kernel(fill_screen, 4096);

Listing 5.4: Now drawing a blue screen using parameters

18 CHAPTER 5. SYSTEM OVERVIEW

5.2.2 More Advanced Kernels

The instruction set available to kernels is fairly limited. Most notably, the control flow in kernels is linear, meaning they cannot do any branches or jumps. See section for more detail. Although the kernels don’t support diverging control flow, conditional execution is accomplished through predicated instructions. Each of the instructions in the instruction set (except for loads and stores) can be executed conditionally by prefixing them with ’?’. Whether a conditional instruction is executed is controlled by a dedicated mask register. The pro- grammer may use arithmetic and logic operations to manipulate this register (such as the srl instruction on line 3 in listing 5.5). The result of a predicated instruction will be discarded if the mask register is 1.

1 ldc $10, 0 ; Load color one 2 ldc $11, 1 ; Load color two 3 srl $mask, $id_lo, 6 ; Shift to the right converts ID to y pos 4 mv $data, $10 5 ? mv $data, $11 ; Will only be executed every other row 6 mv $address_lo, $id_lo 7 mv $address_hi, $id_hi 8 sw 9 thread_finished

Listing 5.5: Conditional execution using predicated instructions

The kernel starts by loading two color parameters. It then stores a shifted thread ID into the mask register. Shifting a thread ID to the right is a trick to convert the ID to a y value, which works when the screen width is a power of two. The mask register is only 1 bit, and will just store the least significant bit written to it. This means that masking will be enabled for odd rows and disabled for even rows. Line 4 first writes a color value to the $data register. When masking is disabled this value will be overwritten on line 5. The kernel finishes by writing the data value to memory. This section contains some simplifications, namely some nop instructions that are required under some specific circumstances have been omitted. The actual kernel code that is sent to the GPU for the kernel that fills the screen with green, can be seen in listing 5.6. The kernel from listing 5.5 however, is executed as is. The reason for these nop instructions will be explained in later chapters.

1 ldi $data, 0b0000011111100000 2 mv $address_lo, $id_lo 3 mv $address_hi, $id_hi 4 sw 5 nop 6 nop 7 nop 8 thread_finished

Listing 5.6: The green-screen kernel as it is actually

19 CHAPTER 5. SYSTEM OVERVIEW

As illustrated by these examples, the kernels on Demolicious are more limited than CUDA kernels, but the programming models are very similar. With these examples in mind, it’s time to take a look behind the scenes and see how the CPU and GPU actually run this code.

20 Chapter 6

CPU

Now that we have seen how to write a kernel that colors the screen with a beau- tiful green color, it is time to see what is actually happening on the Demolicious System during its execution. Our journey starts in the main of the Demolicious computer, namely the CPU, which is implemented on an Silicon Labs Giant Gecko Microcontroller. In this chapter we will follow the kernel from load to execution and explain what happens behind the scenes on the CPU, which is the component on Demolicious that runs the C code seen in the previous chapter.

6.1 Functionality

The main tasks of the CPU are summarized below: • Load kernels into instruction memory on the GPU. • Load kernel parameters into the GPU’s constant memory. • Load data sets into external memory. • Start the execution of a kernel. • Read data back from external memory.

6.2 Communication with the GPU

The backbone of the interaction between the CPU and the GPU is the wide bus connecting them. Since the GPU is designed to increase the performance of the system by parallelizing operations, it’s important that the bus does not introduce too much overhead. We therefore designed a parallel bus based on the EBI (External Bus Interface) specifications provided in the reference manual for the EFM32GG[8, p.175]. The bus has a 16-bit data line, and a 20-bit address line, as well as six control signals.

21 CHAPTER 6. CPU

All the pins connected between the MCU and the FPGA follow the specifi- cations from the EFM32GG’s Reference Manual[8, p.175]. This allows us to utilize the built-in support for EBI which memory maps the whole bus on the microcontroller, and take advantage of utility functions in the emlib software library[10]. The mapping of addresses can be seen in figure 6.2.1. The CPU can interface three different memories for different purposes: 1. Constant memory, for storing constants which may be read by kernels. 2. Instruction memory, for storing the kernels. 3. Data memory, for storing the framebuffer, in addition to other data for kernels.

0x80000000 0x80040000 0x80080000 0x80100000 0x84000000 Constant Instruction Kernel start X External Memory Memory area memory

Figure 6.2.1: Overview of memory mapped address spaces on the CPU.

Writing to all of these memories is done by writing to the memory addresses outlined in figure 6.2.1, in the same way as writing data locally on the micro- controller. Data can also be read back from the external GPU memory. Even the task of starting kernels is done the same way, to be able to utilize the same bus for all GPU communication.

6.3 Loading a Kernel

As seen in the Programming Model chapter, host programs can upload kernels to the GPU. They should preferably do so at the beginning of the program, to avoid delays during execution. The CPU’s Demolicious library has a built-in function for this called load_kernel, seen in listing 6.1. It takes an array of assembled instructions as parameter, allo- cates memory on the GPU and uploads the kernel, before it returns a reference to GPU memory pointing to the start of the kernel. This reference is used when starting kernels.

1 kernel_t fill_screen= load_kernel(fill_screen_kernel);

Listing 6.1: A load_kernel function call with the fillscreen kernel

22 CHAPTER 6. CPU

Figure 6.3.1

Seeing as instructions are 32-bit large, and the data bus is only 16 bits, every instruction load must be divided in two writes.

6.4 Running a Kernel

To run a kernel, the host program executes the run_kernel function with a reference to the kernel address, as well as the number of threads it wants to spawn. Listing 6.2 shows the code required for running a kernel.

1 run_kernel(fill_screen, 4096);

Listing 6.2: Running a kernel

Figure 6.4.1: Starting a kernel on the GPU.

To tell the GPU to start executing a kernel, there are two pieces of information the GPU needs. Which kernel should be started, and how many threads should be executed. As can be seen in figure 6.2.1, there exists a dedicated address space for starting kernels. It is the same size as the instruction memory, meaning any valid instruction address will be valid also in the start kernel space. Writing to the instruction address, but in the start kernel space instead of the instruction memory will tell the GPU to start executing from that particular instruction.

23 CHAPTER 6. CPU

When writing to this start kernel address, we use the data value for telling the GPU how many threads should be spawned. For example, say the fill_kernel reference points to instruction 100 in instruction memory. Then, executing the code in listing 6.2 will write the value 4096 to address 100 in the start kernel area. Actually, writing the number of threads to spawn will limit the maximum num- ber of threads to 216, as the data bus is 16 bits wide. This will not suffice for our target resolution of 512*256, when running one thread per pixel, as that requires 217 number of threads. Instead, the number of batches of threads is written to the GPU. A batch represents the number of threads which are active in the GPU at the same time. The number of threads in a batch will typically be between 8 and 32 depending on the size of the GPU. What’s important is that this work-around will help in increasing the number of threads that can be spawned from 216 up to 221. Only one kernel can execute at the time in the GPU. Because of this, the run_kernel call is designed to block until the kernel has completed execution. This is implemented by having a dedicated signal from the GPU to the CPU which will be asserted whenever the GPU is idle. The CPU can sleep when the GPU is executing and wait for an on this signal, saving energy while waiting for the GPU.

6.5 Loading parameters

The last important feature of the CPU library is to provide support for set- ting kernel parameters. As seen in listing 5.4, this is implemented in the load_constant function. The call takes an address and a value as parameters, as can be seen in listing 6.3

1 load_constant(0, 0x07E0);

Listing 6.3: Setting a kernel parameter

Figure 6.5.1: CPU-GPU interaction when loading a parameter.

The load_constant function is simply implemented by writing the constant value to the requested address, as seen in figure 6.5.1.

24 CHAPTER 6. CPU

6.5.1 Summary

We have now seen how our kernel is bootstrapped and executed. The assembled kernel is uploaded to the GPU which is then given a command to run said kernel with one thread per pixel. Kernels can also read parameters which the host program can easily vary between kernel runs.

25 Chapter 7

GPU

We know how to write a kernel, and how to start it. Let’s deep dive into the heart of Demolicious; the GPU! In this chapter we follow our kernel all the way from the GPU gets the commands from the CPU, through the execution of all the threads, into memory and finally to the screen over HDMI. The Demolicious GPU is implemented on a Spartan-6 FPGA, a programmable hardware chip. The architecture has been designed, sketches drawn, and lastly implemented with VHDL; a hardware definition language.

7.1 Responsibilities

The GPU has the following responsibilities: 1. Receive instructions and constants from the CPU 2. Handle kernel invocations from the CPU 3. Write results to external SRAM 4. Assert the ’computation finished’ signal to the CPU

26 CHAPTER 7. GPU

7.2 Architecture Overview

Figure 7.2.1: A high level overview of the GPU.

Figure 7.2.1 presents a high level overview over the GPU. The CPU issues commands to the communication unit in the GPU. Instructions are fetched from the instruction memory and decoded by the control unit, which has the responsibility of setting the control signals for the instructions. Control signals go to all cores of the GPU. The processor cores access memory through the load/store unit, and get constants from the constant memory. The video unit reads pixels from the data memory and outputs them to a screen over HDMI.

7.3 Receiving a Kernel Call

Figure 7.3.1: Launching a kernel from the GPU’s viewpoint.

27 CHAPTER 7. GPU

The communication unit is responsible for receiving kernel call requests from the CPU. When a kernel call is received, the kernel launch signal is asserted. A kernel call consists of the address of the kernel, and the number of threads to launch. The kernel launch signals are forwarded to the thread spawner, which writes the kernel start address to the PC register, and starts distributing thread IDs to the processor cores. After holding the kernel launch signals high, the commu- nication unit has completed its role in launching the kernel. When the kernel completes executing, the thread spawner asserts the kernel done signal, and the communication unit forwards the signal to the CPU, indicating that the kernel call has completed.

7.4 Running a Kernel

7.4.1 Warps

As in NVIDIA’s Fermi architecture (see chapter3), Demolicious organizes threads into warps. In Demolicious, the warp size is the same as the number of processor cores. For versions with fewer cores, the warp size changes accordingly. Just as the warps in Fermi, every thread in these warps always execute the same in- struction at the same time, with one thread in each core. Note that while Fermi has multiple Streaming Multiprocessors which can each execute warps concur- rently, Demolicious can only execute a single warp at a time. The Demolicious system is analogous to having a single Fermi Streaming Multiprocessor. An important difference between Demolicious and Fermi, is that Demolicious kernels have no jumps, so its threads never diverge. This significantly reduces the complexity of scheduling these warps, and let’s us get away with a single .

7.4.2 Static Scheduling

As every thread in a warp always executes the same instruction, our system needs to handle 8 memory requests being issued during the same cycle. The memory system for Demolicious consists of two SRAM chips, and can return two words every cycle. This bandwidth is not sufficient to satisfy the set of 8 memory requests produced by a warp without stalling. Occasionally, these stalls can be hidden by the programmer if loads are issued ahead of their usage. However, when the program does not have enough useful instructions to do while waiting for loads, the GPU must stall. Stalling a throughput-oriented machine is obviously unfortunate. But there are many warps waiting for execution at (almost) any time. If these warps could execute while other warps are waiting for memory, we can utilize the system better. However, changing between warps that execute requires a context switch. That is, the old thread put on hold has to store all its register values somewhere so that they can be available when it starts executing again.

28 CHAPTER 7. GPU

Figure 7.4.1: Execution timing of warps without jagged scheduling

In software multi-threading, a context switch is typically achieved by storing all registers to memory and loading in the registers for the new thread that is scheduled. This is an expensive operation, and would introduce more latency than we are trying to hide. Demolicious, however, has a set of active warps that all have their own registers. So a context switch can be carried out with virtually no overhead, at the expense of some additional hardware to store all these extra registers. To keep the architecture as simple as possible, Demolicious employs a simple static scheduling algorithm. The active warps are simply rotated after every instruction. In figure 7.4.1 we revisit our green-screen kernel, and see how the warps are scheduled. The numbers in the top right corner is the GPU cycle, and we refer to the rows as barrel lines. During cycle 0, instruction A is executed on warp 0. The next cycle, instruction A is executed on warp 1. And when we get to cycle 4, warp 0 is scheduled again, this time executing instruction B. This continues until we reach clock cycle 16-19 where warps 0 through 3 executes instruction E, the thread_finished instruction, which does no computation, but allows the GPU to set up new warps. On cycle 20, warp 4 is set up and ready to execute the first instruction. This pattern continues until all threads have run to completion. However, there is an issue with this simple scheduling algorithm. Let’s see what happens when we get to instruction D, marked in red. In clock cycle 121 all threads in warp 0 execute a memory request, resulting in 8 simultaneous memory requests. As our memory system handles only two requests per clock, it is busy the next 4 cycles with these requests. When warp 0 gets to execute its next instruction, the memory operation is complete for all the threads in the warp. Had this been a load instead of a store, the loaded value could already be used its instruction, and we have avoided a stall. So far, so good. However, when warp 1 executes the store instruction in cycle 13, the memory is already busy with the request from warp 0. The memory requests from warp 1 will not be started before cycle 16, and will therefore not be complete for its next instruction. It is even worse for warp 2 and 3. As all warps execute the same code, this can introduce up to 8 nop instructions for every warp. 1The cycle numbers are written in the upper right corner of each square

29 CHAPTER 7. GPU

Figure 7.4.2: Execution timing of warps with jagged scheduling

Clearly, the introduction of this barrel processing technique did not solve our problems. Fortunately, this was a simplified version of the scheduling in Demo- licious. The actual scheduling is shown in figure 7.4.2. By introducing an offset in the instructions executed, henceforth called jagged execution2, we can avoid this issue. With this scheduling, you can see that there are 4 cycles without a memory operation between every memory operation. This is true as long as memory operations have 4 non-memory operations between them in the kernel. It is up to the programmer or assembler to ensure that this holds. This warp scheduling algorithm will guarantee that nops are not required after loads, as long as they are spaced far enough apart. 2Or "jaktstart" as we like to call it in Norwegian

30 CHAPTER 7. GPU

7.5 Module Details

7.5.1 Processor Core

Figure 7.5.1: A single processor core.

Each processor core contains an ALU and a register directory. On each clock cycle the ALU receives the selected values from the register directory. On the same cycle, the value is written back to the register selected by the control unit. If the current instruction is a load constant instruction, the control unit selects which constant to fetch from constant memory. The constant enable signal is asserted, causing the to select the constant value as the write back value. To perform immediate instructions, the control unit asserts the immediate enable signal. The immediate value is then selected by the multiplexer and used as an operand in the ALU. The architecture contains multiple processor cores like the one in figure 7.5.1. Each of the cores receive the same control signals for the instructions.

31 CHAPTER 7. GPU

7.5.2 Thread Spawner

Figure 7.5.2: Thread spawner and neighboring components.

32 CHAPTER 7. GPU

Figure 7.5.3: Thread spawner RTL implementation.

The thread spawner is responsible for overseeing kernel execution. It will spawn threads whenever necessary, handling thread setup and ensuring that the re- quested kernel is executed. When all threads have finished executing, it will assert the kernel done signal, notifying the host program that computation has finished. When a kernel invocation request is received, the thread spawner stores the provided base address of the kernel, the number of threads to spawn, and sets the next thread ID register to zero. Threads are spawned one warp at a time into the currently active barrel line. Therefore, on kernel start the thread spawner will spawn threads until the warp scheduler has made an entire rotation, or a barrel roll as we call it. When a thread finishes execution, the control unit will assert the finished signal to the thread spawner. If there are threads left to spawn, the currently active barrel line will be filled with a new warp of threads. But what does the spawning of a warp of threads actually entail? All threads need a unique thread ID. If the value of the next thread ID register is 4, and we have 4 processor cores, core zero will write 4 to its thread ID register, core one 5, and so forth. The next thread ID register is then incremented by the warp size, increasing it from 4 to 8.

33 CHAPTER 7. GPU

7.5.3 Register Directory

Figure 7.5.4: The register directory, and its neighboring components.

There is one register directory per processor core. Each register directory con- tains one register file per barrel line. The register files include seven dedicated registers, and nine general purpose registers, listed in table 7.5.1.

34 CHAPTER 7. GPU

Register number Description RW $0 Zero register Read-only $1 ID High Read/Write $2 ID Low Read/Write $3 Address High Read/Write $4 Address Low Read/Write $5 LSU data Read/Write $6 Masking register Write $7 - $15 General purpose registers Read/Write

Table 7.5.1: The registers contained within the register files.

Individual register files are selected by using the register file select signal. Using this signal, the warp scheduler selects the active register file. Within the register directory, signals are routed to the active register file using the select signal. Consequently, from the processor core’s point of view, there’s just one register file. In the architecture the dedicated registers have special functions. Ignoring regis- ters $0 and $6, the dedicated registers may be used as general purpose registers. The masking register is used to enable conditional execution, as introduced in section 5.2.2. When executing predicated instructions, the masking register is used to disable writes to the register file. This makes the instruction have no effect. The physical implementation is seen in figure 7.5.5. To reduce the complexity of the architecture, store word instructions cannot be predicated. Instead, the programmer has to manage the values written explicitly.

Figure 7.5.5: Using the mask bit to disable register writes.

Data memory addresses are 20 bit wide, and thread IDs are 19 bit wide. The word size is only 16 bits. To represent these values, they are split into two registers. The load store unit (LSU) has a separate signal for selecting which register file

35 CHAPTER 7. GPU to write to. Using this signal, the LSU can write data loaded from memory into any register files on its own.

7.5.4 SRAM Arbiter

The SRAM arbiter connects the LSU, HDMI unit and CPU to the SRAM. As multiple units may request access to SRAM at the same time, and the Demolicious system is using an interlaced memory model, dedicated access to SRAM cannot be guaranteed. Therefore, a prioritization scheme is required. The CPU has the highest read-write priority. As the host program by convention shouldn’t access memory when kernels are executing, this is fine. The LSU comes in second, as it needs uninterrupted access to SRAM during execution. If the HDMI unit could prevent the LSU from accessing memory, the result of a kernel executions could differ based when the HDMI unit chooses to read the current framebuffer. Therefore the HDMI unit has the lowest priority, having to scavenge pixels whenever it can.

7.5.5 Load/Store Unit

Figure 7.5.6: RTL of the load/store unit. Notice the dual queueing system to handle the interlaced SRAM.

A load/store unit, shown in figure 7.5.6, is responsible for handling memory requests from the core processors. The load/store unit of Demolicious needs to be capable of simultaneously servicing incoming requests from all the processor cores. Requests are asynchronously carried out in the background, and replies

36 CHAPTER 7. GPU

Figure 7.5.7: The logical overview of the video unit. A prefetching unit tries to fill the buffer when it is not full. The Accepted signal signifies a successful memory request. Across the clock domain boundary, the video timing unit generates control signals. When pixel data should be transferred, the Send signal moves pixels from the buffer to the transmission units. The 16-bit pixel data is converted to three 8-bit colors, TMDS encoded, serialized and output using differential signaling. Should the buffer go empty, the prefetch unit needs to immediately advance to the next pixel to stay synchronized with the timing unit. are delivered directly into the appropriate registers of the threads that made the request. Demolicious has two independent memory banks, each capable of reading or writing one word every cycle. To maximize the throughput to the memory, a word-striping scheme is used: The lowest bit of the memory address determines which bank holds that location. Because there are two independent memory banks, the incoming requests are routed to two separate queues, one for each memory. Each queue then feeds requests to its associated SRAM chip. Read responses are then handed to the write-back unit, which delivers the data to the appropriate register file. The queues are necessary because Demolicious has more processor cores than memory banks, so all the requests from a warp can not be completed in a single cycle, as discussed in section 7.4.1. The processor cores work in tandem, and issue requests simultaneously. It takes several cycles to complete the requests for a single warp. Since there is a limit to how many requests can be in the queue at any time, there is also a limit to how often requests can be issued.

7.5.6 HDMI

From a demo makers perspective, a GPU without a video output is commonly known as a space heater. Demolicious uses HDMI for its video output, making it easy to connect to any recent video display. HDMI is a streaming protocol; the receiver reads data from the cable at a fixed rate. In Demolicious, the GPU has priority access to the memory. This

37 CHAPTER 7. GPU means that the video unit may not have access to the framebuffer (which lies in memory) when it’s time to send a pixel. To alleviate this issue, as much of the framebuffer as possible is prefetched into a buffer whenever memory is available. Should the buffer underflow, black pixels will be sent instead. Otherwise, pixels will not be synchronized with the position they should appear at on the screen. Control signals assert where in the data stream a new frame of video starts and ends. These allow the receiver to determine the resolution and refresh rate of the video. The lowest resolution supported by HDMI is 640x480. As this is larger than our framebuffers (64 x 64 pixels), a letterbox is added around the picture. For debugging purposes, the letterbox consists of a low-contrast checker pattern. To actually send the data over HDMI, control signals and pixel data are split into three channels. They are then encoded using a scheme known as TMDS. The purpose of TMDS is to minimize the effect of noise over the physical connection. TMDS uses 10 bits to encode either an 8-bit color value when sending an image, or control values when not. Demolicious uses a 16-bit word size, so colors are represented with 5 bits for red, 6 for green and 5 for blue. These are resized to 8-bit values using a scheme that allows for both complete black and white colors. Each channel is then serialized before being output together with a clock using differential-signaling. Finally, to avoid a visual artifact known as screen tearing, a technique known as V-sync with double buffering is used. These techniques ensure that only complete frames of video are output, increasing visual fidelity.

7.5.7 Summary

The journey of our kernel is complete. We have followed it all the way from the initial load_kernel call to the screen. It has traveled through the massively parallel GPU, which can handle a vast amount of threads, using a round-robin static scheduling technique called barrel processing.

38 Chapter 8

Physical Implementation

Figure 8.0.1: Completed PCB

The entire Demolicious system is implemented on a PCB (Printed Circuit Board). This chapter will give an overview of the hardware architecture, and detail design choices made during the realization of the design.

39 CHAPTER 8. PHYSICAL IMPLEMENTATION

8.1 Backup oriented design

Verifying a PCB design is difficult. If the circuitry has been designed incorrectly, there may not be much that can be done, except for designing a new PCB. Because of this, the PCB has backup solutions for all major components, and exposes as much as possible of the important circuitry on headers. This enables us to probe signals for debugging purposes, and also do manual rewiring in case of faulty wiring.

Figure 8.1.1: Conceptual overview of the PCB. Green boxes are main solutions. Blue boxes are backup plans. Gray boxes labeled "PINS" mean that these signals are exposed on headers.

These backup plans are in place to make sure the board will work, even if some parts are broken. That way, each component can be connected to other sources than those on the board alone. Because of this, the board is not optimized for the smallest size possible, but was rather made to optimize for highest possible chance of success.

40 CHAPTER 8. PHYSICAL IMPLEMENTATION

8.2 Input

The main method of communication between the microcontroller and a host PC is by USB. The USB circuitry is designed with ESD protection in an effort to lessen the risk of frying the PCB. Should the USB fail, a serial port (RS-232) has been implemented as a backup solution. If this also fails, the wires from the serial port is put on headers, which can be used as GPIO pins. If all else fails, the program to be run on the machine can be included when programming the MCU through the programming interface.

8.3 Output

The main source of output from the PCB is an HDMI connector. This is a novel feature this year, as no previous group has tried to implement it before. Because of this new challenge, a lot of backup schemes were put in place. The HDMI connector is put on headers, in case the connector fails. A VGA module is connected to the FPGA, in case the HDMI does not work. And if this fails, a separate VGA module is connected to the microcontroller.

8.3.1 HDMI implementation

The general HDMI specification consists of the Transition Minimized Differ- ential Signaling (TMDS), and some additional wires for details regarding the transmitted signal [4]. However, we were able to produce a video feed on a screen from an FPGA using only the 8 TMDS wires by cutting up an HDMI cable and using only those wires. Upon seeing that this was feasible, we de- cided to make the HDMI hardware with a TMDS connection only, going in to the FPGA. The resulting hardware was then an HDMI type-A receptacle foot- print, the HDMI receptacle and a header between it and the FPGA. The reason for adding the header was that this setup is equivalent to the aforementioned prototype version. If the HDMI output wouldn’t work we could connect the terminated HDMI cable onto the header.

8.4 Power

The system is powered by a 5V mini USB. This connection powers two voltage regulator circuits. Respectively, the 1.2V wire that powers the FPGA (excluding the I/O banks) and the 3.3V power plane that powers up the rest of the machine. The 5 volt USB connection has a backup solution in the form of a header through which it passes immediately after entering the PCB. This is done so that the incoming electricity can easily be probed for voltage level and also so that in case the USB connection fails, an external power supply can be connected to the board.

41 CHAPTER 8. PHYSICAL IMPLEMENTATION

Figure 8.4.1: Picture of the physical power circuit, where P5 is the header on which an external power source can be connected

8.5 Bus

The bus between the MCU and FPGA is an EBI connection. The EFM32GG990 has dedicated pins for the EBI, which can be found in the datasheet [9] to the MCU. Headers are placed in between the MCU and FPGA in case of failure on either side of the connection, as well as an easy way to check transmitted signals during debugging.

8.6 Clocks

Both the MCU and the FPGA need an external clock. According to the design consideration AN0002 [7] the MCU needs two crystals, a low frequency crystal at 32.768kHz and a high frequency crystal at 48MHz. For the FPGA, an oscillator of 120MHz was chosen. Headers are placed between the oscillator and the FPGA. The headers serve as a backup in case an external oscillator is needed and act as a debugging tool. The microcontroller clocks on the other hand have a backup solution in that the microcontroller has internal RC-oscillators for use in case the crystals malfunction.

8.7 Main Components

MCU (EFM32GG990F512-BGA112)

The EFM32 Giant Gecko Microcontroller from former Energy Micro, now Silicon Labs, is chosen as the microcontroller for this project. One aspect of the task at hand is energy efficiency and this microcontroller is particularly efficient in that manner. This is a proven microcontroller with a lot of development boards available to us, so it seemed like a safe choice.

42 CHAPTER 8. PHYSICAL IMPLEMENTATION

FPGA (XC6SLX45-2CSG324I)

The FPGA of the Spartan-6 family from Xilinx was chosen as the FPGA. This particular FPGA has been used for different tasks on the university before, and the support systems are therefore available to us. A less powerful version of this one was available for testing on development boards in the lab.

SRAM (AS7C38098A)

The SRAM AS7C38098A was chosen as the memory storage for this project [6]. The reason for why this particular SRAM was chosen, was that to be able to output graphical demos, we need to be able to store framebuffers, i.e. pixel data which will be displayed on screen. We need memory large enough to hold two framebuffers and a latency small enough to be able to read and write data to the memory at an acceptable rate. This specific SRAM met the requirement with a storage space at 512K words by 16 bit and a 10ns access time.

43 Chapter 9

Additional Tools

9.1 Assembler

An assembler is implemented, to be able to run kernels written for Demolicious on the GPU. The assembler is written in Python. In addition to supporting assembly of all instructions supported by the GPU, it provides additional pseudo instructions, as well as aliases for special purpose registers. The assembler is available on GitHub https://github.com/dmpro2014/ simulator/.

9.2 Simulator

In addition to the assembler, a tool for verifying the correctness of kernels before they are run on the GPU has been developed. It is based on the same parsing backend as the assembler, and simulates the execution of kernels, instruction for instruction. When the set of threads have been simulated, a resulting frame- buffer is rendered to screen. This tool has proven itself very useful when devel- oping kernels, as it reduces the need of re-assembling and uploading kernels to the GPU to verify the correctness of the assembly code. The simulator is part of the same project as the assembler described in section 9.1.

44 Part III

Results & Discussion

45 Chapter 10

Testing

10.1 GPU Component testing

The GPU consists of a number of fairly isolated components. It’s valuable to test the components in isolation before they are connected. Both implementation errors, and design flaws, can be uncovered before the components are introduced to the system. The unit tests were implemented as VHDL test benches. Table 10.1.1 describes the unit tests that were run, and provides a summary of which features were tested.

46 CHAPTER 10. TESTING

COMPONENT WHAT STATUS

Barrel row is incremented each clock Barrel selector cycle. Passed Barrel row wraps around to 0. PC write enabled high on row 0. Constants can be written. Constant memory Passed Constants can be read. Decodes instruction types correctly. Inst decode/Control unit Passed Sets control signal values correctly. Instructions can be written. Instruction memory Passed Instructions are read from the correct address. Instructions can be written. Instruction memory Passed Instructions are read from the correct address. Read/write to general purpose regis- Register file ters. Passed Dedicated registers behave correctly. Does the masking bit work? Register directory Multiplexes input/output signals to Passed the correct register file. Arithmetic operations. Processor core Passed Can mask instructions. Computes arithmetic operations. ALU Can do left/right shifts. Passed Performs Set if less than correctly.

Table 10.1.1: Unit tests for components in the Demolicious system.

10.2 VHDL system integration tests

Before deploying to an actual FPGA, it is important to ensure correct behavior in system-level testbenches. This testing is valuable, as if correct behavior can be verified in simulation, there are fewer potential errors when debugging the FPGA itself. System tests for each of the major through the design have been created and run successfully. The result of each test is verified by comparing all

47 CHAPTER 10. TESTING

pixels in the framebuffer after the kernel has been run to precomputed values in ISim.

10.2.1 Memory Stores and Kernel Parameterization

To verify that stores to memory, as well as constant memory, actually works, the kernel presented in listing 5.3 is used as a system test. It has been re-listed in listing 10.1 for convenience.

1 ldc $data, 0 2 mv $address_lo, $id_lo 3 mv $address_hi, $id_hi 4 sw 5 thread_finished

Listing 10.1: Kernel to test constant memory and parameterization

Expected behavior of test: 1. The color green should successfully be loaded from constant memory. 2. It should be stored to memory. 3. The screen should be filled with the color green.

Simulation Results

(a) Isim simulation showing constant load and store word

(b) LX16 run

Figure 10.2.1: Results from simulation and LX16 of fillscreen kernel

48 CHAPTER 10. TESTING

In the left yellow square of figure 10.2.1a, one can see the load constant instruc- tion being executed (0x08050000) in barrel line 1. The Constant to reg signal is asserted, and the constant value 0x07E0 is passed into the register write data signal. In the right yellow square, the store word instruction (0x100000000) executes in barrel line 0. The LSU accepts the write request same cycle (LSU write request goes high), and two cycles later the request packet reaches the LSU data line. The LSU asserts LSU write_n, (the signal is active low), and external RAM handles the store request. The testbench passes, and values have now been successfully written to memory. It also runs on actual hardware, the result shown in figure 10.2.1b.

10.2.2 Predicated Instruction Execution

Predicated instructions are used to allow for some degree of conditional ex- ecution in the lack of proper branching and jumps. This requires that the architecture actually respects the mask bit when set. The predicated execution kernel presented earlier in listing 5.5 is used for this test. It has been re-listed in listing 10.2 for convenience.

1 ldc $10, 0 ; Load color one 2 ldc $11, 1 ; Load color two 3 srl $mask, $id_lo, 6 ; Shift to the right converts ID to y pos 4 mv $data, $10 5 ? mv $data, $11 ; Will only be executed every other row 6 mv $address_lo, $id_lo 7 mv $address_hi, $id_hi 8 sw 9 thread_finished

Listing 10.2: Conditional execution using predicated instructions

Expected behavior of test: 1. Each row should be colored according to the last bit of their y position.

49 CHAPTER 10. TESTING

Simulation Results

(a) Isim simulation showing successful masking.

(b) LX16 run

Figure 10.2.2: Results from simulation and LX16 of predicated execution

In the left yellow square of figure 10.2.2a, we can see the srl instruction being executed (0x00023181). As this thread has thread id 192, the result out is 3. The low bit is stored into the mask register, enabling masking for this thread. In the middle yellow square, barrel 0 is once again active, and we can see that the predicate bit of core 0 has been asserted. As this instruction isn’t masked, the predicate bit is ignored and the value of 001f is stored into the data register. In the right yellow square, the conditional data move is executed (0x81602804). As the mask enable signal goes high, the register write enable signal is pulled low due to the predicate bit, resulting in the data not being written to registers. The testbench passes, and predicated instructions are not executed when mask- ing is enabled. As can be seen in figure 10.2.2b, the kernel runs on actual hardware, and every second line is colored differently.

10.2.3 Loads from Primary Memory

If one wishes to support multi-pass kernels, that is a kernel that uses the result of previous kernels to compute its own result, then results have to be stored and loaded from somewhere persistent. Stores to main memory have already been confirmed working, so load instructions are up next for testing. With functioning loads, it is fairly straightforward to implement things like Conway’s Game of Life [1] for the Demolicious system.

50 CHAPTER 10. TESTING

The fillscreen kernel from earlier is modified to load its value from main memory, and store it back to the framebuffer. It is presented in figure 10.3.

1 mv $address_hi, $0 ; Load some color from main memory 2 mv $address_lo, $0 3 lw 4 mv $address_hi, $id_hi 5 ldi $8, 2 6 add $address_lo, $8, $id_lo 7 sw ; Store data to ID + 2, to avoid overriding address 0 8 nop 9 thread_finished

Listing 10.3: Kernel to test loads from main memory

Expected behavior of test: 1. The loaded color should be stored to the entire framebuffer

Simulation Results

(a) Isim simulation showing working loads from memory

(b) LX16 run

Figure 10.2.3: Results from simulation and LX16 of load kernel

Figure 10.2.3a shows a barrel of height 2 performing successful loads. In the first cycle of the yellow dotted square, barrel row 0 executes a load word instruction (0x20000000). LSU read request is asserted in the control, and routed to the LSU. Two cycles later the request has passed through the LSU unit, and SRAM responds with the value 55. In the last cycle of the yellow

51 CHAPTER 10. TESTING square, the LSU asynchronously writes the result back to the registers storing the values on the LSU data write-back line to barrel row 0. In the blue dotted square, the same procedure is executed again, this time for the second barrel row. Notice that register_file_select is now set to 1, writing the result back to the second barrel row. The testbench passes. Multi-pass kernels can now successfully be implemented. There is however a discrepancy between the simulation and the actual imple- mentation, resulting in the LSU dropping some write-backs, as shown in figure 10.2.3b. Both the RAM used for simulation, as well as the SRAM located on the PCB is combinatoric. On the development kit however, SRAM had to be mapped onto block RAM due to a lack of space. This forces the RAM to be clock driven, not allowing for the same-cycle memory responses required, and therefore dropping write-backs.

10.3 Verification of Communication Channels

10.3.1 JTAG test

In these tests we attempted to connect to respectively, the FPGA and the MCU by way of the appropriate headers, thereby verifying that they had been properly soldered in place and were functional.

WHAT HOW STATUS

Verify FPGA Flash code on FPGA that makes a JTAG’s ability to Passed LED blink flash FPGA Verify ARM debug Flash code on MCU to make a LED interface’s ability Passed blink to flash MCU

Table 10.3.1: JTAG tests

52 CHAPTER 10. TESTING

10.3.2 EBI Bus

WHAT HOW STATUS

EBI bus Configure MCU and its GPIO pins and configuration on write to memory mapped addresses for Passed MCU EBI. Inspect signals with a logic analyzer. Connect MCU and FPGA with EBI, and EBI receiver use MCU to write to and read from Passed module on FPGA FPGA block RAM.

Table 10.3.2: EBI bus tests

10.3.3 HDMI Output

WHAT HOW STATUS

Writing to screen Connect the wires of a DVI cable to from FPGA using GPIO pins on FPGA. Write color values Passed HDMI using HDMI standard.

Table 10.3.3: HDMI tests

10.3.4 SRAM Communication

WHAT HOW STATUS

Write to and read Connect the working EBI protocol of from SRAM over MCU to an SRAM chip. Write words to Passed EBI SRAM and read them back. FPGA to SRAM Connect FPGA GPIO pins to SRAM. Use Unfin- communication FPGA to write words to SRAM and read ished over EBI them back

Table 10.3.4: SRAM tests

10.4 PCB tests

After receiving the finished PCB and the components, a series of tests were performed to make sure all parts of assembling the finished computer were done correctly.

53 CHAPTER 10. TESTING

10.4.1 Solder, signal and power test

The purpose of these tests was to solder components and check that their con- nections held, as well as check that power correctly propagated through the board.

WHAT HOW STATUS

Solder components, measure voltage on 5 V power from headers on 5 V power line, verify that it is Passed USB connector 5 V 3.3 V power from Solder components, measure voltage on regulator and VDD Passed 3.3 V power plane, verify that it is 3.3 V headers 1.2 V power from Solder components, measure voltage from regulator and 1.2 V regulator and power indicator LED, Passed power indicator verify that it is 1.2 V LED Place MCU and FPGA on PCB and bake MCU and FPGA them in oven, make sure baking looks Passed soldering good and balls have melted into sockets Solder helping headers, connect FPGA MCU and FPGA and MCU to power, verify correct Passed connected correctly soldering by flashing MCU and FGPA

Table 10.4.1: Solder plan and verification

10.4.2 Oscillator and clock test

These tests consisted of using an oscilloscope in order to probe the outputs of the FPGA’s oscillator and the high frequency and low frequency crystals of the MCU.

WHAT HOW STATUS

FPGA oscillator Use oscilloscope on out pin on the wave frequency at Passed FPGA oscillator 120 MHz MCU low Incompre- Use oscilloscope to measure low frequency crystal hensible frequency crystal clock at 32.768 kHz results MCU high Incompre- Use oscilloscope to measure high frequency crystal hensible frequency crystal clock at 48 MHz results

Table 10.4.2: Frequency output tests

54 Chapter 11

Results

Figure 11.0.1: The working Demolicious devkit setup

In this section, the fruit of our labor, namely the Demolicious system is pre- sented. The system has been sucessfully run on a concoction of FPGA and MCU devkits, using the HDMI port of the PCB for video output. This setup is showcased in figure 11.0.1. A PCB version is almost fully operational. A version with both FPGA and MCU flashed with their respective images managed to output animations, with some undiagnosed showstopping software glitches. The venture had to be paused due to a lack of time towards the end of the project. Kernel images presented in this section are taken from the same setup presented in figure 11.0.1.

55 CHAPTER 11. RESULTS

11.1 Scalability of the Demolicious system

The architecture of the Demolicious system easily scales to a larger number of processors. As there is a linear relationship between the number of processor cores on chip and processor throughput, one can simply add more cores to increase performance. Limiting factors to this scalability include: 1. The more cores, the lower the clock frequency, as signal propagation time and fanout increases 2. Space available on the chosen FPGA 3. Power consumption constraints, as more cores increase active power con- sumption

Cores Crit. path Max freq. Dynamic+quiescent power LX16 LX45

2 17.103ns 58.469MHz 0.272 W: 0.088 + 0.184 XX 4 17.959ns 55.682MHz 0.292 W: 0.107 + 0.184 XX 8 19.722ns 50.108MHz 0.353 W: 0.168 + 0.186 X X 16 X X X X X

Table 11.1.1: Hardware configurations compared. Harvested from post place & route static simulation.

The 4-core design fits with room to spare on the LX16, but the 8-core design does not fit. The LX45 shifts this up one notch, fitting the 8-core design, but being unable to place & route the 16-core design. For all processor designs, the critical path passes from the immediate field of the instruction through the ALU into the active register file. This is something that could be decreased drastically by pipelining the processor. Figure 11.1.1 shows that there is a negligible drop in maximum frequency from two to eight cores. At 50.108MHz with 8 cores, the GPU has an instruction throughput of 400 MIPS. This compares favorably to the 117 MIPS of the two-core architecture. It does however come with a 100mAh increase in power draw. Luckily this only constitutes a 30% total increase in power, considerably less than the fourfold improvement in GPU throughput.

11.2 Performance

In the Demolicious system, kernel calls can be issued by the CPU, taking only a few cycles. This means that the number of frames per second (FPS) that the

56 CHAPTER 11. RESULTS system can display is dominated by the run time of the kernels.

(a) Output from the green screen ker- (b) Output from the tunnel kernel. nel.

Figure 11.2.1: Running two example kernels.

Figure 11.2.1 shows the output from the green screen kernel, and a more complex tunnel effect kernel (listingC). It’s desirable that both these kernels can be run at about 30 FPS. Using the results presented in section 11.1, the expected frames per second for varying resolutions can be estimated.

10,000 64 × 64 px 512 × 256 px 1920 × 1080 px 1,000

100 Log frames per second 10

2 4 8 Cores

Figure 11.2.2: Running the green screen kernel.

57 CHAPTER 11. RESULTS

64 × 64 px 1,000 512 × 256 px 1920 × 1080 px

100

10 Log frames per second

1

2 4 8 Cores

Figure 11.2.3: Running the tunnel kernel.

The figures 11.2.3, and 11.2.2 display the relationship between frame rate, res- olution and number of cores. Both figures display that doubling the amount of core roughly doubles the frame rate. For a configuration of cores, the time it takes to process one pixel is constant. This means that the time to execute one kernel scales linearly with the resolu- tion. When the output resolution is increased, the amount of pixels to process increases quadratically. As a consequence the frame rate decreases quadratically when the resolution grows. For the target resolution for the project, which is 512 × 256, the project goal of maintaining 30 fps is achieved.

11.3 Video output

The Demolicious system can output to a screen using HDMI. The minimum resolution permitted by the HDMI protocol is 640 × 480, but the size of the data memory limits the actual resolution to 512 × 256 pixels. The rest of the screen is padded with a checker pattern. Most of the time the output image is correct. However, the output image is distorted under certain conditions.

58 CHAPTER 11. RESULTS

Figure 11.3.1: Flickering when running the tunnel kernel. This picture was exposed over two frames.

Some kernels exhibit intermittent flickering (Figure 11.3.1). The exact reason for why this occurs is unclear, but it can be observed that the flicker contains parts of the last frame. This may be caused by a failure in the synchronization mechanism in the video unit. Since the GPU has priority on memory access, the video unit may get starved for data. When the video unit is starved the buffer containing pixels to output will underflow. For the duration of the starve, a line containing the previous pixel on the bus will be displayed on the screen.

11.4 Single vs double buffering

Screen tearing is a visual artifact where parts of two consecutive frames are displayed at the same time. This occurs because the video unit reads the frame buffer before the GPU has finished rendering it. In figure 11.4.1 the artifact can be observed, occurrences are marked with red circles.

59 CHAPTER 11. RESULTS

Figure 11.4.1: Single buffering.

Double buffering is a technique to remove this artifact. As the name implies two independent frame buffers are used. While one frame buffer is being read and displayed on screen, the next frame is rendered to an off-screen frame buffer. Once the frame has finished rendering the frame buffers are swapped, and the frame is displayed by the video unit. In figure 11.4.2 it can be observed that double buffering improves the quality of the image substantially. The image in the figure does have some artifacts, but the ones caused by single buffering are no longer present.

Figure 11.4.2: Double buffering.

11.5 Demolicious Power efficiency

A Demolicious setup of 8 cores 11.1.1 has a power draw of 0.353 W. By quickly and efficiently executing the kernel at hand, the GPU can reduce its total static power consumption. Reducing the dynamic consumption however, is more dif- ficult. The current GPU architecture will execute nops when no kernels are active, keping dynamic power consumption sligthly lower than average, as no memory requests need to be served. Therefore, Demolicious has been designed with the goal 4.2.1 of allowing the host system to sleep as much as possible.

60 CHAPTER 11. RESULTS

The Giant Gecko microcontroller used as the host device has four different energy modes, each lower energy mode more efficient, but more constrained. Deeper power levels may conserve more power, but require a longer time to wake up. EM0 is normal operation, while EM2 (Deep sleep mode) is the deepest power level with acceptable wake-up times for our usecase (somewhere around 2 ms) [11]. There are two different sleep patterns used by the MCU based on the executing program: 1. A visual animation or similar, rendering at a known target FPS. The MCU can stay at energy mode EM2 when kernels are executing, waking at fixed intervals to update and relaunch kernels. 2. Programs working on some large dataset, where it is beneficial to tightly pack kernel execution, reducing GPU idling. The host device can therefore enter EM2 after launching a kernel, using the kernel complete pin to trigger an interrupt, waking the host device from sleep. The design is sound, and has been implemented successfully before by the au- thors, but there was not enough time to fully implement it for this project in the host code. ehe following numbers are therefore harvested from Silicon labs Simplicity studio energyAware Battery estimator. Demolicous with 8 execution cores can run the tunnel kernel at 50 FPS @ 512x256, spending 12.5 ms per frame 11.2.3. The tunnel kernel is to be executed 30 times per second, with the MCU waking via a timer interrupt to spawn the new kernels. Kernel launch overhead is roughly 10µ + 2ms = 2ms. When waking 30 times per second, this results in 60 ms awake per second. EM0 with EBI consumes 142.6 mA, EM2 190µa. This then averages to roughly 142.6mA ∗ 0.060 + 0.19mA ∗ 0.940 = 8.734 mA average consumption. For the large dataset kernel, assume it takes 100ms to complete execution. The MCU will have to wake 10 times per second, using 2 ms to wake and a negligible overhead to deploy new kernels. This results in 10 ∗ 2ms = 20ms spent in EM0, and 980 ms spent in EM2. This averages to 142.6mA∗0.020+0.19mA∗0.980 = 3.038mA average consumption. Without any use of sleep mode for the MCU, the 142.6 mA drawn make for a total power consumption of 142.6mA ∗ 3.3V = 470.6mW . However, for the tunnel kernel example, making use of sleep mode puts the power consumption at 8.734mA∗3.3V = 28.8mW . For the large dataset kernel, power consumption is at 3.038mA ∗ 3.3V = 10.0mW . These results show that there is a clear gain to be had from introducing low power modes.

Power budget

The computer is powered by a EH-70p USB charger for the Nikon Coolpix S2700 camera [12, p. 196]. This charger outputs a 5 V voltage and delivers 550 mA. This gives the power input an upper bound of power consumption at 550mA ∗ 5V = 2.75W .

61 CHAPTER 11. RESULTS

Based on checking the memory datasheet [6], page 4, the average current drawn from the SRAM is 100mA. At two components, the memory energy consumption becomes 2 ∗ 100mA ∗ 3.3V = 660mW , which makes them the most energy demanding part of the computer. With these major components and their estimated power consumption, one can see that the final system draws approximately 660mW + 353mW + 28mW = 1041mW . The total power consumed is likely a little higher on account of signal propagation, but it is still significantly less than the 2.75W we accounted for in the power design.

62 Chapter 12

Discussion

Hindsight is 20/20. Looking back at the development of Demolicious, some design decisions have stood the test of time better than others. A project of this size with such a short timeframe requires quick decision making, sometimes sacrificing the optimal solution for a working solution. For example, an extreme focus on redundancy in the design of the PCB has resulted in almost all compo- nents being usable, sacrificing some power efficiency in the process. This section will present controversial design decisions, better solutions where available, and workarounds.

12.1 Energy Efficiency

12.1.1 Optimizing FPGA power usage

While it was relatively easy to save power on the CPU side of things, the FPGA posed a bigger challenge. The CPU easily enters deeper sleep modes, but no such thing exists by default for the Spartan-6 FPGA. One way to reduce FPGA power consumption is to clock gate components, in effect turning them off when unneeded. It turns out that Xilinx R supports dynamic for the Spartan-6 series used in this project [16]. This feature can reduce power con- sumption up to 30% for components opting to use chip enable signals. The presented architecture has some usage of chip select signals, but does not shut down execution cores, instruction fetch or the LSU unit when idling. The addition of such a system-wide chip enable signal would reduce power usage during GPU idle time, a feature that the next iteration of Demolicious should include. This was not implemented for this project however, mainly due to time constraints. Now for a quick calculation to see how much power can be saved in an absolute best-case scenario. Demolicious with 8 execution cores can run the tunnel ker- nel at 50 FPS @ 512x256 (as seen in figure 11.2.3), spending 12.5 ms per frame. When targeting 30 FPS, the GPU would spend 375 ms calculating frames, al- lowing the GPU to be in a low power state for almost 62.5% of the time. Using

63 CHAPTER 12. DISCUSSION a theoretical maximum of 30% dynamic power savings for the low power state, the dynamic power consumption of 0.168 W from figure 11.1.1, can be reduced to 0.70 ∗ 0.625 ∗ 0.168W + 0.375 ∗ 0.168W = 0.1365W . This represents a savings in dynamic power consumption of 18.7%. The total FPGA power draw would be reduced to 0.1365W + 0.1860W = 0.3225W , a 8.6% reduction over the origi- nal 0.3530 W. This shows that FPGA power consumption still is dominated by the quiescent draw.

12.1.2 Physical Implementation

To make a PCB as energy efficient as possible, one relies upon good practices. The wires have to be as short as possible and the board as small as possible. This will optimize for lowest possible static power consumption. Short wires for signal wires will make dynamic power consumption as low as possible. The backup oriented design of the PCB doesn’t fit very well in with this. The amount of headers make wires unnecessarily long, as well as making the board itself quite large. Headers need to make a hole through the whole board, which makes it impossible to put wires on that place. Since the PCB has six signal layers, it means that each header removes the possibility of six wires going through where the header is. This makes the static power consumption higher than it could be. By having the entire bus on headers, the wires between the CPU and the GPU is longer than they could have been. This makes the dynamic power consumption higher than if the wires are shorter. However all of this is a trade-off as there was only one chance for the PCB to work.

12.2 Programming Challenges

There are several limitations and issues a programmer must have in mind when developing programs for the Demolicious system. First of all, since no compiler for high-level languages is available, kernel code must be written in assembly. The limitations of the language and relatively low expressiveness it provides, makes kernels verbose and require substantial mental overhead. The implementation of conditional execution, through predicated instructions only, can also be quite difficult to get accustomed to. Chaining multiple con- ditions together to form a correct logical structure, can require sketching and some upfront planning of the program flow even for simple kernels. These are, however, problems which are quite easily fixed by allowing compila- tion from a language with a higher abstraction level. A tougher limitation is, although also one which can be handled by a compiler, the limitations on the frequency of memory instructions. Depending on the number of GPU cores, memory instructions can’t occur more than each X cycles. Dealing with this, without the performance drop of executing nop instructions is important when rendering live graphics.

64 CHAPTER 12. DISCUSSION

12.3 SRAM - Alternative Memory Layouts

The presented architecture uses interlaced SRAM banks, presenting one large unified memory space. An alternative approach is to dedicate one SRAM bank to the LSU, the other to the HDMI unit. This approach will work when using double buffering, as the LSU and HDMI unit can swap banks on each framebuffer flip. With dedicated banks there will be no need for arbitration, as there never will be more than one unit accessing each at bank the time. This would result in the HDMI unit never starving due to memory lockout from the LSU. The LSU would also be simpler, needing only a single queue to service memory requests. The major advantage of interlacing the address space of the two memory banks is that memory throughput effectively doubles. Two requests can be processed in parallel from the unit currently granted access by the SRAM arbiter. When a warp of size n requests memory access, it will take n/2 cycles to dispatch all requests, as each SRAM bank services half of the requests. Therefore, the barrel height only has to be equal to half of the number of processor cores. With the dedicated memory banks, however, only one memory request can be serviced each cycle. This requires that the barrel height is increased to n. As the register directories compose a significant part of the design footprint, the barrel height reduction is significant. Signal fanout is also reduced from and to the register banks, dampening the reduction in clock frequency when scaling the number of computation cores. More hardware resources can now be allocated to processor cores, power is saved due to less registers being used, and the clock frequency can stay higher. Drawbacks include the occasional flicker due to HDMI unit starvation.

12.4 Multiple clock domains in HDMI unit

The HDMI unit needs to output words with a frequency of 25 MHz. If it were to share clock with the rest of the system, the system clock would have to be 25 Mhz itself or a multiple thereof. To avoid imposing this limitation, the buffer in the HDMI unit crosses clock domains. Since the buffer receives data at native memory speed, it is filled faster than the HDMI unit can read from it. The data read out at 25 MHz also need to be serialized to a frequency of 250 MHz before being output from the FPGA. This is to the knowledge of the authors, the first project in the Computer Project course with a design incorporating multiple clocks.

12.5 Troubleshooting and Workarounds

During testing, it became apparent that there were two major issues that needed to be fixed with the PCB. Firstly, the pin on the FPGA used for the clock was not one of the dedicated clock pins. Secondly, the clock component had not been implemented according to spec.

65 CHAPTER 12. DISCUSSION

FPGA clock pin problem

An error with the pin connection between the FPGA and the clock was discov- ered during testing. The FPGA demanded a special pin for the clock, but the PCB routed the clock to a regular GPIO pin. After checking the PCB layout, a pin on the EBI bus was shown to be able to work as clock for the FPGA. Since the clock is connected by jumpers to the FPGA, the pin on the EBI bus could be wired to the clock jumper on the FPGA, solving the problem.

Wrong implementation of the clock

Due to a poor schematic in the clock component datasheet, the clock was wired incorrectly on the PCB, as illustrated in figure 12.5.1.

Figure 12.5.1: On the left: How the clock from the FPGA should be connected. On the right: how it was connected on the PCB.

Because of this, the clock did not work. This problem was solved by soldering together the pads of the SMD (surface mounted device) capacitor to allow the clock signal to flow freely to the header pin. A through-hole version of the removed capacitor was placed on one of the soldering pads and the other end grounded, see figure 12.5.2. This caused the circuit on the PCB to match up with the correct circuit from the data sheet.

66 CHAPTER 12. DISCUSSION

Figure 12.5.2: How the FPGA clock fix was implemented.

Mismatch LDO

During the testing phase, the LDO for the 1.2V did not output the correct voltage. The 1.2V LDO is the same model as the 3.3V LDO, but has a different footprint, see figure 12.5.3. It was assumed that both components had the same, which resulted in mismatch between the footprint and the component. The solution was to place the LDO diagonally, and manually solder a wire to the correct solderpad on the PCB.

67 CHAPTER 12. DISCUSSION

Figure 12.5.3: On the left: 3.3 V LDO footprint. On the right: 1.2 V LDO footprint.

12.6 Hardware components

In this section we will discuss some of the physical design, alternative solutions and the practical solutions that arose during physical implementation.

USB data input

In the initial design, a USB connector was introduced to receive data from a host PC. However, during the implementation phase it was discovered to be unnecessary. We discovered that the software applications we wanted to run on the computer, could be included when programming the microcontroller with the JTAG. This worked sufficiently for our needs. Since at the time we still had much else to do, implementing input by USB became a very low priority and in the end never happened. This also applied to the serial backup as well.

VGA-port

The PCB was designed with two VGA headers, one for the FPGA and one for the MCU. An alternative solution is having one VGA port with exposed headers. With exposed headers from the MCU and the FPGA as well, one could change which one the VGA port is connected to by jumpers like 12.6.1. By removing one of the VGA ports, the size of the PCB could have been reduced, or used for other components.

68 CHAPTER 12. DISCUSSION

Figure 12.6.1: On the left: The current setup on our PCB. On the right: The alternative solution.

12.6.1 Clocks

Oscillator

As seen on figure 12.6.2, the measured period of the oscillator, with a ruler, is around 8.5ns, which translates to around 118 MHz. The tolerance of the oscillator is up to 100 ppm, see datasheet [3]. Since using a ruler as a mesure tool will give unreliable results, one can assume the oscillator gives the correct output.

69 CHAPTER 12. DISCUSSION

Figure 12.6.2: The oscillator on an oscilloscope.

MCU crystals

As can be seen from the frequency tests 10.4.2, the crystal output frequency tests did not pass their requirements. They were probed for output, but this resulted in no discernable signal on the oscilloscope. However, as the implementation continued, we found that the MCU worked as we desired. This led us to conclude that we had likely misunderstood how to measure the frequencies of the crystals.

12.7 Budget

The group was given a budget of 10 000 NOK for production of the PCB and component purchases. Ten boards were manufactured at the price of 10 103 NOK and components were ordered at the price of 7 571 NOK. (see appendices A.1.1 and A.2.1). This made for a total cost of 17 674 NOK, which results in a cost overrun of 76 %. There were mainly two reasons for this. Firstly, the PCB design took so long that a fast production time was required. This, along with a very large board, drove the cost up greatly. Secondly, com- ponents for at least 5 PCBs were bought. If fewer components were bought, the price would be lower. However, this all is a part of the backup oriented design. For this reason there are two main ways to spend less money. The total price could be driven down by having a less backup oriented design with fewer com- ponents. This would significantly lower the total cost. Also, if the time limit weren’t that strict, and the board could use longer time in production, the price would be lower.

70 Chapter 13

Conclusion

We designed and implemented a graphics processing unit inspired system, ca- pable of executing a large amount of threads fast enough to display graphics to a display over HDMI. The final system meets all the requirements set in the beginning of the project. The goal of at least 30 FPS has been reached for several kernels, at the target screen resolution of 512x256 pixels when running with 8 processor cores. CPU power savings were simulated successfully, showing great promise. A fully functional PCB was created with each individual component working. The implementation of the Demolicious system on the PCB was almost finalized, with only minor issues remaining. There is still room for further improvement of the computer, which will be detailed in the next section. This project has been extremely challenging and demanding. Because of this, the group as a whole have gained great insights into the inner workings of computers and GPUs.

13.1 Further Work

A beautiful thing with projects like this, is that they can be done significantly better when done the second time. This section will focus on possible further improvements of the system.

13.1.1 Architecture

As one of the consequences of the barrel processor, the processor lends itself very well to pipelining. The consecutive instructions in the pipeline will always be from different threads. As long as there are fewer pipeline stages than the height of the barrel, an instruction will complete its path through the pipeline before the next one from the same thread starts. This means that there are no

71 CHAPTER 13. CONCLUSION data hazards in the pipeline. Thus it is as simple as dividing the processor into stages and adding registers. Currently the critical path in Demolicious is going through the processor cores. Adding a pipeline to the processor would therefore increase the maximum frequency the processor can run at. As the SRAMs on the Demolicious can handle 100 MHz, there is room for improvement in the system clock frequency. This directly translates to higher throughput.

13.1.2 Physical design

The physical design of Demolicious worked as intended. A second version will allow for a greater energy efficiency and a smaller size. All headers and unnec- essary backup solutions can be removed, letting a new design focus on a small PCB with short wires and small power planes. If a more powerful computer is to be made, a more powerful FPGA can be introduced along with a matching number of SRAMs, as a more powerful FPGA will have a greater memory need. Furthermore, power and data can be unified in a single USB connection. The power USB and the FPGA are currently positioned far from one another, but in a new design they can be moved closer so that the 1.2V wire that powers the FPGA can be made as short as possible. Lastly, it would have been convenient to have had flash memory for the FPGA as this would allow for retaining the program after the power has been shut off. This would eliminate the need to flash the FPGA after each power off.

13.1.3 Compiler

A major hurdle to exploiting the full potential of the Demolicious system was the difficulty and lack of user-friendliness when writing programs in Demolicious assembly. To greatly simplify the programming, a C compiler with Demolicious assembly as target could be implemented.

72 Bibliography

[1] John Conway. Conway’s Game of Life. http://en.wikipedia.org/ wiki/Conway%27s_Game_of_Life. [2] John L. Hennessy David A. Patterson. Comuter Organization and Design. [3] Fox Electronics. Fox XpressO Oscillator. http://www.foxonline. com/pdfs/FXO_HC73.pdf. [4] fpga4fun. HDMI pinout. http://www.fpga4fun.com/HDMI.html. [5] IDC. PC Market Beats Expectations with Mild Improvement in Business Outlook. http://www.idc.com/getdoc.jsp?containerId=prUS24375913. [6] Alliance Memory inc. SRAM datasheet. http://alliancememory. com/pdf/sram/fa/as7c38098a.pdf. [7] Silicon Laboratories Inc. EFM32GG Design consideration. http://www. silabs.com/Support%20Documents/TechnicalDocs/AN0002. pdf. [8] Silicon Laboratories Inc. EFM32GG Reference Manual. [9] Silicon Laboratories Inc. EFM32gg990 datasheet. http://www.silabs. com/Support%20Documents/TechnicalDocs/EFM32GG990.pdf. [10] Silicon Labs. EFM32 emlib Peripheral Library: EBI. http://devtools. silabs.com./dl/documentation/doxygen/EM_CMSIS_3.20.7_ DOC/emlib_gecko/html/group__EBI.html. [11] Silicon Labs. EFM32GG Reference Manual. http://www.keil.com/ dd/docs/datashts/energymicro/efm32gg/d0053_efm32gg_ reference_manual.pdf. [12] Nikon. Coolpix S2700 Reference Manual. http://cdn- 10.nikon- cdn.com/pdf/manuals/coolpix/S2700RM_%28En%2903.pdf. [13] Jon Peddie Research. Add-in board market up in Q3, Nvidia increases mar- ket share lead. http://jonpeddie.com/publications/add-in-board-report/. [14] Jon Peddie Research. GPU shipments Q2 2014 - Charts and Images. http://jonpeddie.com/news/comments/gpu-shipments-marketwatch-q2-2014- charts-and-images/. [15] Jon Peddie Research. Qualcomm Single Largest Proprietary GPU Sup- plier. http://jonpeddie.com/press-releases/details/qualcomm-single-largest- proprietary-gpu-supplier-imagination-technologies-t/.

73 [16] Xilinx R . Xilinx R ISE R Design Suite 12 what’s new. http://www. xilinx.com/support/documentation/white_papers/wp370_ Intelligent_Clock_Gating.pdf.

List of Tables

4.2.1 Goals set for the Demolicious system...... 14

7.5.1 The registers contained within the register files...... 35

10.1.1Unit tests for components in the Demolicious system...... 47 10.3.1JTAG tests ...... 52 10.3.2EBI bus tests...... 53 10.3.3HDMI tests...... 53 10.3.4SRAM tests...... 53 10.4.1Solder plan and verification...... 54 10.4.2Frequency output tests...... 54

11.1.1Hardware configurations compared. Harvested from post place & route static simulation...... 56

A.2.1Component order...... 81

B.1.1Register overview...... 83 B.3.1R-type instructions...... 85 B.3.2Instruction format for R-type instructions...... 85 B.3.3I-type instructions...... 86 B.3.4Instruction format for I-type instructions...... 86 B.3.5Pseudo instructions...... 86

List of Figures

3.3.1 Streaming Multiprocessor ...... 11 3.3.2 Warps of 32 threads execute ...... 12

5.1.1 Logical overview of the system...... 16

74 LIST OF FIGURES

5.2.1 Relationship between CPU and GPU code...... 17

6.2.1 Overview of memory mapped address spaces on the CPU. . . . . 22 6.3.1...... 23 6.4.1 Starting a kernel on the GPU...... 23 6.5.1 CPU-GPU interaction when loading a parameter...... 24

7.2.1 A high level overview of the GPU...... 27 7.3.1 Launching a kernel from the GPU’s viewpoint...... 27 7.4.1 Execution timing of warps without jagged scheduling ...... 29 7.4.2 Execution timing of warps with jagged scheduling...... 30 7.5.1 A single processor core...... 31 7.5.2 Thread spawner and neighboring components...... 32 7.5.3 Thread spawner RTL implementation...... 33 7.5.4 The register directory, and its neighboring components...... 34 7.5.5 Using the mask bit to disable register writes...... 35 7.5.6 RTL of the load/store unit. Notice the dual queueing system to handle the interlaced SRAM...... 36 7.5.7 The logical overview of the video unit. A prefetching unit tries to fill the buffer when it is not full. The Accepted signal signifies a successful memory request. Across the clock domain bound- ary, the video timing unit generates control signals. When pixel data should be transferred, the Send signal moves pixels from the buffer to the transmission units. The 16-bit pixel data is converted to three 8-bit colors, TMDS encoded, serialized and output using differential signaling. Should the buffer go empty, the prefetch unit needs to immediately advance to the next pixel to stay synchronized with the timing unit...... 37

8.0.1 Completed PCB ...... 39 8.1.1 Conceptual overview of the PCB. Green boxes are main solutions. Blue boxes are backup plans. Gray boxes labeled "PINS" mean that these signals are exposed on headers...... 40 8.4.1 Picture of the physical power circuit, where P5 is the header on which an external power source can be connected...... 42

10.2.1Results from simulation and LX16 of fillscreen kernel ...... 48 10.2.2Results from simulation and LX16 of predicated execution . . . . 50 10.2.3Results from simulation and LX16 of load kernel...... 51

11.0.1The working Demolicious devkit setup...... 55 11.2.1Running two example kernels...... 57 11.2.2Running the green screen kernel...... 57 11.2.3Running the tunnel kernel...... 58 11.3.1Flickering when running the tunnel kernel. This picture was ex- posed over two frames...... 59 11.4.1Single buffering...... 60 11.4.2Double buffering...... 60

12.5.1Onthe left: How the clock from the FPGA should be connected. On the right: how it was connected on the PCB...... 66

75 LIST OF FIGURES

12.5.2How the FPGA clock fix was implemented...... 67 12.5.3On the left: 3.3 V LDO footprint. On the right: 1.2 V LDO footprint...... 68 12.6.1On the left: The current setup on our PCB. On the right: The alternative solution...... 69 12.6.2The oscillator on an oscilloscope...... 70

A.1.1PCB order...... 80

76 Listings

3.1 A sequential program filling the screen with green...... 10 3.2 A CUDA kernel filling a single pixel with green...... 10 3.3 Starting the CUDA kernel with one thread per pixel on a 1920x1080 screen...... 10 5.1 A simple kernel that fills the screen with the color green..... 17 5.2 Loading and executing a kernel...... 18 5.3 A kernel loading the color value from a parameter...... 18 5.4 Now drawing a blue screen using parameters...... 18 5.5 Conditional execution using predicated instructions...... 19 5.6 The green-screen kernel as it is actually...... 19 6.1 A load_kernel function call with the fillscreen kernel...... 22 6.2 Running a kernel...... 23 6.3 Setting a kernel parameter...... 24 10.1 Kernel to test constant memory and parameterization...... 48 10.2 Conditional execution using predicated instructions...... 49 10.3 Kernel to test loads from main memory...... 51

77 Part IV

Appendices

78 Appendix A

Expenses

79 APPENDIX A. EXPENSES

A.1 PCB manufacturing

Figure A.1.1: PCB order

80 APPENDIX A. EXPENSES

A.2 Component purchases

# Description Amount Unit Price (NOK) Total (NOK) 1 HDMI Receptacle 5 8.20 41.00 2 Memory 10 104.00 1040.00 3 1k Res 100 0.19 19.00 4 Buttons 25 3.35 83.75 5 LEDs 30 5.30 159.00 6 Inductors 25 1.35 33.75 7 SPD 5 16.45 82.25 8 LED Resistor 20 1.05 21.00 9 USB Receptacle 10 4.25 42.50 10 1.2V res 25 1.33 33.25 11 Low Freq. Crystal 5 9.75 48.75 12 15 Ohm res 10 1.05 10.50 13 1.2V regulator 5 4.50 22.50 14 3.3V regulator 5 22.70 113.50 15 4.7 kOhm res 10 1.05 10.50 16 ESD protection 5 1.20 6.00 17 Headers 20 9.95 199.00 18 Jumpers 200 1.05 210.00 19 470nF caps 60 3.00 180.00 20 100uF caps 35 5.39 188.65 21 4.7uF caps 40 1.65 66.00 22 100nF caps 100 0.75 75.00 23 10uF caps 10 3.58 35.80 24 1uF caps 10 2.55 25.50 25 12pF caps 10 1.12 11.20 26 22pF caps 10 1.65 16.50 27 15pF caps 10 1.25 12.50 28 10nF caps 10 3.60 36.00 29 FPGA 10 402.32 4023.20 30 MCU 10 51.35 513.50 31 48 MHz Crystal 5 14.27 71.30 32 120 MHz Oscillator 5 17.10 85.50 33 Noise Suppressor 5 10.84 54.18

Total cost (NOK) 7571.08

Table A.2.1: Component order

81 Appendix B

Instruction set architecture

This appendix will provide an overview of the instruction set supported by the GPU.

82 APPENDIX B. INSTRUCTION SET ARCHITECTURE

B.1 Registers

The following registers are available:

Number Name Description R/W Size

$0 zero Always contains the value Read-only 16 zero $1, $2 id_hi, id_lo The current thread’s ID Read-only 16 $3, $4 address_hi, address_lo Address used by load & Read/Write 16 store instructions $5 data Data loaded/stored by Read/Write 16 load & store instructions $6 mask Conditional instruc- Read/Write 1 tions will be masked (section B.2) when this register is set to 1. $7-$15 General-purpose Read/Write 16

Table B.1.1: Register overview

Special registers may be referenced by their name in assembly code.

83 APPENDIX B. INSTRUCTION SET ARCHITECTURE

B.2 Predicated instructions

The only supported way of conditional execution is through predicated instruc- tions (masking). All instructions except for sw have a predicated version which will only execute when masking is disabled. In assembly, an instruction prepended with a question mark will be executed conditionally. The first bit of the instruction will be set to one for the conditional versions. These predicated instructions will still be executed, but they will never store their result in the destination register. Masking is controlled by the dedicated masking register. Instructions can write to this register to turn masking on and off. The register is only one bit, and will therefore only keep the least significant bit of data written to it.

84 APPENDIX B. INSTRUCTION SET ARCHITECTURE

B.3 Instructions

B.3.1 R-type Instructions

All R-type instructions have opcode 00000. Their syntax, meaning and under- lying ALU function are described in table B.3.1.

Instruction Example Meaning ALU Function

Add add $1, $2, $3 $1 = $2 + $3 0x4 Subtract sub $1, $2, $3 $1 = $2 - $3 0x5 Multiply mul $1, $2, $3 $1 = $2 * $3 0x9 And and $1, $2, $3 $1 = $2 & $3 0x6 Or or $1, $2, $3 $1 = $2 | $3 0x7 Xor xor $1, $2, $3 $1 = $2 ^ $3 0x8 Set on less than slt $1, $2, $3 $1 = ($2 < $3) ? 1 : 0 0x3 Set on equal seq $1, $2, $3 $1 = ($2 == $3) ? 1 : 0 0xA Shift Left Logical sll $1, $2, 10 $1 = $2 << 10 0x0 Shift Right Logical srl $1, $2, 10 $1 = $2 >>> 10 0x1 Shift Right Arithmetic sra $1, $2, 10 $1 = $2 >> 10 0x2

Table B.3.1: R-type instructions

1 5 5 5 5 5 1 5 mask opcode rs rt rd sh X function

Table B.3.2: Instruction format for R-type instructions.

85 APPENDIX B. INSTRUCTION SET ARCHITECTURE

B.3.2 I-type Instructions

Instruction Example Meaning Opcode

Load constant ldc $7, 1 $7 = constant_memory[1] 0x2 Add immediate addi $7, $7, 2 $7 = $7 + 2 0x1 Load lw $data = memory[$address] 0x8 Store sw memory[$address] = $data 0x4 Thread finished thread_finished Stops executing the kernel 0x10 Nop nop Do nothing 0x0

Table B.3.3: I-type instructions.

1 5 5 5 16 mask opcode rs rd immediate

Table B.3.4: Instruction format for I-type instructions.

B.3.3 Pseudo Instructions

Some additional instructions are supported by the assembler in order to make programming for the GPU easier. An overview of these instructions is provided in table B.3.5.

Instruction Example Translated instruction

Move mv $1, $2 add $1, $0, $2 Load immediate ldi $1, 1 addi $1, $0, 1

Table B.3.5: Pseudo instructions

86 87 APPENDIX C. COMMENTED TUNNEL KERNEL

Appendix C

Commented tunnel kernel

1 ; The idea is to draw a green tunnel. 2 ; This effect consists of 2 squares in different sizes and a X across the screen 3 ; To make this happen assume we are on a pixel that should be drawn green 4 ; We then make a series of checks to verify if we are on the correct pixel or not 5 ; If we ever see that we are on the wrong pixel, we will set our mask bit high 6 ; This will make it so that we draw black instead of green 7 ; 8 ; To make this happen we have to use a general purpose register as masking register 9 ; This is because if the masking value gets set high, we don’t want to overwrite 10 ; it with low, so we need to remember it. 11 12 ; First square 13 ldi $15, 0b111111 ; 63 bitmask 14 ldi $14, 64 ; Set width & height of screen into $14 15 and $7, $15, $id_lo ; Load x value into $7 16 srl $8, $id_lo, 6 ; Load y value into $8 17 18 ldi $data, 0b0000011111100000 ; Set default color to green 19 20 mv $12, $0 ; shadow mask - used as temporary mask bit to 21 ; be able to read masking value 22 23 ldc $13, 10 ; Set offset from edges 24 sub $14, $14, $13 ; Set offset from opposite edges 25 26 ; Draw LEFT line of box 27 seq $10, $7, $13 ; Checks that we are on correct x value 28 29 slt $11, $13, $8 ; Checks that we are inside our box from the top 30 and $10, $10, $11 31 32 slt $11, $8, $14 ; Checks that we are inside our box from the bottom 33 and $10, $10, $11 34 35 or $12, $12, $10 ; Adds result into masking bit

88 APPENDIX C. COMMENTED TUNNEL KERNEL

36 ; Draw RIGHT line of box 37 seq $10, $7, $14 ; Same system as above, but checking the other 38 ;side of the box 39 slt $11, $13, $8 40 and $10, $10, $11 41 42 slt $11, $8, $14 43 and $10, $10, $11 44 45 or $12, $12, $10 46 47 48 ; Draw UPPER line of box 49 seq $10, $8, $13 ; Instead of checking the vertical lines, the horizontal lines 50 ; are checked 51 slt $11, $13, $7 52 and $10, $10, $11 53 54 slt $11, $7, $14 55 and $10, $10, $11 56 57 or $12, $12, $10 58 59 60 ; Draw LOWER line of box 61 seq $10, $8, $14 ; Same as above, but lower line instead of upper 62 63 slt $11, $13, $7 64 and $10, $10, $11 65 66 slt $11, $7, $14 67 and $10, $10, $11 68 69 or $12, $12, $10 70 71 72 ; Next square 73 addi $14, $0, 64 ; Load screen width & height into $14 74 ldc $13, 11 ; Set new offset from edges 75 sub $14, $14, $13 ; Set offset from opposite edges 76 77 ; Draw LEFT line of box 78 seq $10, $7, $13 ; Same check for the square as previous square 79 80 slt $11, $13, $8 81 and $10, $10, $11 82 83 slt $11, $8, $14 84 and $10, $10, $11 85 86 or $12, $12, $10 87 88 ; Draw RIGHT line of box 89 seq $10, $7, $14 90 91 slt $11, $13, $8 92 and $10, $10, $11 93 94 slt $11, $8, $14 95 and $10, $10, $11 96 97 or $12, $12, $10 89 APPENDIX C. COMMENTED TUNNEL KERNEL

98 ; Draw UPPER line of box 99 seq $10, $8, $13 100 101 slt $11, $13, $7 102 and $10, $10, $11 103 104 slt $11, $7, $14 105 and $10, $10, $11 106 107 or $12, $12, $10 108 109 ; Draw LOWER line of box 110 seq $10, $8, $14 111 112 slt $11, $13, $7 113 and $10, $10, $11 114 115 slt $11, $7, $14 116 and $10, $10, $11 117 118 or $12, $12, $10 119 120 121 ; Half cross 122 seq $10, $7, $8 ; if x = y then we are on \ 123 124 or $12, $12, $10 125 126 ; Other half cross 127 addi $13, $0, 0b111111 ; 63 bitmask 128 sub $14, $13, $8 ; negative y valye, check / 129 130 seq $10, $14, $7 131 132 or $12, $12, $10 133 134 mv $mask, $12 ; Store temporary mask value into actual mask register 135 136 ? ldi $data, 0x0000 ; Write black if not on either squares or on the cross 137 138 ldc $10, 5 139 add $address_lo, $10, $id_lo 140 sw ; save values

90 Appendix D

PCB schematics

91 1 2 3 4

A A

3.3V

2333551 PIR802 COR8R8 B Res3 B 4.7KOhm PIR801

DONE_SGN COFPGAEFPGAE V17 PIFPGA0V17 DONE_2 JTAGCOJTAG PODONE0SIGPODONE0SIG0DONE0SIGNALDONE_SIG Done_signal V2 PIJTAG01 1 2 PIJTAG02 PIFPGA0V2 PROGRAM_B_2 R16 PIJTAG03 3 4 PIJTAG04 PIFPGA0R16 SUSPEND PIJTAG05 5 6 PIJTAG06 A17 PIJTAG07 7 8 PIJTAG08 PIFPGA0A17 TCK D15 PIJTAG09 9 10 PIJTAG010 PIFPGA0D15 TDI D16 PIFPGA0D16 TDO JTAG B18 PIFPGA0B18 TMS

PIFPGA0P13P13 C CMPCS_B_2 C XC6SLX45-2CSG324I

GND

GND

Title D FPGA programming inteface D

Size Number Revision A Date: 11/25/2014 Sheet of File: C:\Users\..\FPGA_CLK_SUSPEND.SchDoDrc awn By: 1 2 3 4 1 2 3 4

A A

HDMI COFPGAAFPGAA D4 IO_L1P_HSWAPEN_0 PIFPGA0D4 0

C4 IO_L1N_VREF_0 PIFPGA0C4 NLHDMI0D20 D0+ K B2 HDMI.D2+ IO_L2P_0 PIFPGA0B2 D0- N A2 HDMINLHDMI0D20.D2- IO_L2N_0 PIFPGA0A2 A NLHDMI0D10 PIFPGA0D6D6 HDMI.D1+

B IO_L3P_0 D1+ C6 HDMINLHDMI0D10.D1- HDMINLHDMI IO_L3N_0 PIFPGA0C6 D1- POHDMIPOHDMI0CLK0POHDMI0D00POHDMI0D10POHDMI0D20HDMI B3 HDMINLHDMI0D00.D0+ IO_L4P_0 PIFPGA0B3 A3 HDMINLHDMI0D00.D0- IO_L4N_0 PIFPGA0A3 D2+ B4 HDMINLHDMI0CLK0.CLK+ IO_L5P_0 PIFPGA0B4 D2- A4 HDMINLHDMI0CLK0.CLK- B IO_L5N_0 PIFPGA0A4 B C5 IO_L6P_0 PIFPGA0C5 CLK+ A5 NLFPGA0VGA0I0O0J20p0FPGA_VGA-I/O.J2_p0 IO_L6N_0 PIFPGA0A5 CLK- B6 NLFPGA0VGA0I0O0J20p1FPGA_VGA-I/O.J2_p1 IO_L8P_0 PIFPGA0B6 A6 NLFPGA0VGA0I0O0J10p2FPGA_VGA-I/O.J1_p2 IO_L8N_VREF_0 PIFPGA0A6 C7 NLFPGA0VGA0I0O0J10p3FPGA_VGA-I/O.J1_p3 IO_L10P_0 PIFPGA0C7 A7 NLFPGA0VGA0I0O0J20p4FPGA_VGA-I/O.J2_p4 VGA-I/O IO_L10N_0 PIFPGA0A7 D8 NLFPGA0VGA0I0O0J20p5FPGA_VGA-I/O.J2_p5 IO_L11P_0 PIFPGA0D8 J1_p3 C8 NLFPGA0VGA0I0O0J20p6FPGA_VGA-I/O.J2_p6 IO_L11N_0 PIFPGA0C8 J1_p2 B8 NLFPGA0VGA0I0O0J20p7FPGA_VGA-I/O.J2_p7 IO_L33P_0 PIFPGA0B8 J1_p8 A8 NLFPGA0VGA0I0O0J10p8FPGA_VGA-I/O.J1_p8 IO_L33N_0 PIFPGA0A8 J1_p9 D9 NLFPGA0VGA0I0O0J10p9FPGA_VGA-I/O.J1_p9 IO_L34P_GCLK19_0 PIFPGA0D9 J1_p10 C9 NLFPGA0VGA0I0O0J10p10FPGA_VGA-I/O.J1_p10 IO_L34N_GCLK18_0 PIFPGA0C9 J1_p11 B9 NLFPGA0VGA0I0O0J10p11FPGA_VGA-I/O.J1_p11 IO_L35P_GCLK17_0 PIFPGA0B9 J1_p12 A9 NLFPGA0VGA0I0O0J10p12FPGA_VGA-I/O.J1_p12 IO_L35N_GCLK16_0 PIFPGA0A9 J1_p13 D11 NLFPGA0VGA0I0O0J10p13FPGA_VGA-I/O.J1_p13 IO_L36P_GCLK15_0 PIFPGA0D11 J1_p14 C11 NLFPGA0VGA0I0O0J10p14FPGA_VGA-I/O.J1_p14 FPGA_NLFPGA0VGA0I0OVGA-I/O IO_L36N_GCLK14_0 PIFPGA0C11 J1_p15 POVGA0I0OPOVGA0I0O0J10P2POVGA0I0O0J10P3POVGA0I0O0J10P8POVGA0I0O0J10P9POVGA0I0O0J10P10POVGA0I0O0J10P11POVGA0I0O0J10P12POVGA0I0O0J10P13POVGA0I0O0J10P14POVGA0I0O0J10P15POVGA0I0O0J20P0POVGA0I0O0J20P1POVGA0I0O0J20P4POVGA0I0O0J20P5POVGA0I0O0J20P6POVGA0I0O0J20P7VGA-I/O C10 NLFPGA0VGA0I0O0J10p15FPGA_VGA-I/O.J1_p15 IO_L37P_GCLK13_0 PIFPGA0C10 J2_p1 A10 IO_L37N_GCLK12_0 PIFPGA0A10 J2_p0 C G9 C IO_L38P_0 PIFPGA0G9 J2_p6 F9 IO_L38N_VREF_0 PIFPGA0F9 J2_p4 B11 IO_L39P_0 PIFPGA0B11 J2_p5 A11 IO_L39N_0 PIFPGA0A11 J2_p7 B12 IO_L41P_0 PIFPGA0B12 A12 IO_L41N_0 PIFPGA0A12 C13 IO_L50P_0 PIFPGA0C13 A13 IO_L50N_0 PIFPGA0A13 B14 IO_L62P_0 PIFPGA0B14 A14 IO_L62N_VREF_0 PIFPGA0A14 F13 IO_L63P_SCP7_0 PIFPGA0F13 E13 IO_L63N_SCP6_0 PIFPGA0E13 C15 IO_L64P_SCP5_0 PIFPGA0C15 A15 IO_L64N_SCP4_0 PIFPGA0A15 D14 IO_L65P_SCP3_0 PIFPGA0D14 C14 IO_L65N_SCP2_0 PIFPGA0C14 B16 IO_L66P_SCP1_0 PIFPGA0B16 A16 IO_L66N_SCP0_0 PIFPGA0A16 Title D XC6SLX45-2CSG324I FPGA Bank 0 - Video output D

Size Number Revision A Date: 11/25/2014 Sheet of File: C:\Users\..\FPGA_IO_1.SchDoc Drawn By: 1 2 3 4 1 2 3 4

COFPGABFPGAB F15 DATA0NLDATA00ADDR0.ADDR0 IO_L1P_A25_1 PIFPGA0F15 1 NLDATA00ADDR1 F16 DATA0.ADDR1 IO_L1N_A24_VREF_1 PIFPGA0F16

K C17 DATA0NLDATA00ADDR2.ADDR2 IO_L29P_A23_M1A13_1 PIFPGA0C17

N C18 DATA0NLDATA00ADDR3.ADDR3 A IO_L29N_A22_M1A14_1 PIFPGA0C18 A A F14 DATA0NLDATA00ADDR4.ADDR4 IO_L30P_A21_M1RESET_1 PIFPGA0F14 B G14 DATA0NLDATA00ADDR5.ADDR5 DATA_MEMORY_0 IO_L30N_A20_M1A11_1 PIFPGA0G14 D17 DATA0NLDATA00ADDR6.ADDR6 IO_L31P_A19_M1CKE_1 PIFPGA0D17 WRITE D18 DATA0NLDATA00ADDR7.ADDR7 IO_L31N_A18_M1A12_1 PIFPGA0D18 ENABLE H12 DATA0NLDATA00ADDR8.ADDR8 DATA0NLDATA0 IO_L32P_A17_M1A8_1 PIFPGA0H12 PODATA00PODATA000ADDR0PODATA000ADDR1PODATA000ADDR2PODATA000ADDR3PODATA000ADDR4PODATA000ADDR5PODATA000ADDR6PODATA000ADDR7PODATA000ADDR8PODATA000ADDR9PODATA000ADDR10PODATA000ADDR11PODATA000ADDR12PODATA000ADDR13PODATA000ADDR14PODATA000ADDR15PODATA000ADDR16PODATA000ADDR17PODATA000ADDR18PODATA000ADDR0000180PODATA000BUS0PODATA000BUS1PODATA000BUS2PODATA000BUS3PODATA000BUS4PODATA000BUS5PODATA000BUS6PODATA000BUS7PODATA000BUS8PODATA000BUS9PODATA000BUS10PODATA000BUS11PODATA000BUS12PODATA000BUS13PODATA000BUS14PODATA000BUS15PODATA000BUS0000150PODATA000ENABLEPODATA000LBPODATA000UBPODATA000WRITEDATA_0 BUS[0..15] G13 DATA0NLDATA00ADDR9.ADDR9 IO_L32N_A16_M1A9_1 PIFPGA0G13 ADDR[0..18] E16 DATA0NLDATA00ADDR10.ADDR10 IO_L33P_A15_M1A10_1 PIFPGA0E16 UB E18 DATA0NLDATA00ADDR11.ADDR11 LB IO_L33N_A14_M1A4_1 PIFPGA0E18 K12 DATA0NLDATA00ADDR12.ADDR12 IO_L34P_A13_M1WE_1 PIFPGA0K12 K13 DATA0NLDATA00ADDR13.ADDR13 IO_L34N_A12_M1BA2_1 PIFPGA0K13 F17 DATA0NLDATA00ADDR14.ADDR14 IO_L35P_A11_M1A7_1 PIFPGA0F17 F18 DATA0NLDATA00ADDR15.ADDR15 IO_L35N_A10_M1A2_1 PIFPGA0F18 H13 DATA0NLDATA00ADDR16.ADDR16 FPGA_LEDS IO_L36P_A9_M1BA0_1 PIFPGA0H13 H14 DATA0NLDATA00ADDR17.ADDR17 IO_L36N_A8_M1BA1_1 PIFPGA0H14 FPGA_NLFPGA0LEDSLEDS LED_2 H15 DATA0NLDATA00ADDR18.ADDR18 POFPGA0LEDSPOFPGA0LEDS0LED01POFPGA0LEDS0LED02FPGA_LEDS IO_L37P_A7_M1A0_1 PIFPGA0H15 LED_1 H16 DATA0NLDATA00BUS0.BUS0 IO_L37N_A6_M1A1_1 PIFPGA0H16 G16 DATA0NLDATA00BUS1.BUS1 IO_L38P_A5_M1CLK_1 PIFPGA0G16 G18 DATA0NLDATA00BUS2.BUS2 IO_L38N_A4_M1CLKN_1 PIFPGA0G18 J13 DATA0NLDATA00BUS3.BUS3 B IO_L39P_M1A3_1 PIFPGA0J13 B K14 DATA0NLDATA00BUS4.BUS4 IO_L39N_M1ODT_1 PIFPGA0K14 L12 DATA0NLDATA00BUS5.BUS5 IO_L40P_GCLK11_M1A5_1 PIFPGA0L12 L13 DATA0NLDATA00BUS6.BUS6 IO_L40N_GCLK10_M1A6_1 PIFPGA0L13 K15 DATA0NLDATA00BUS7.BUS7 IO_L41P_GCLK9_IRDY1_M1RASN_1 PIFPGA0K15 K16 DATA0NLDATA00BUS8.BUS8 IO_L41N_GCLK8_M1CASN_1 PIFPGA0K16 L15 DATA0NLDATA00BUS9.BUS9 IO_L42P_GCLK7_M1UDM_1 PIFPGA0L15 L16 DATA0NLDATA00BUS10.BUS10 IO_L42N_GCLK6_TRDY1_M1LDM_1 PIFPGA0L16 H17 DATA0NLDATA00BUS11.BUS11 IO_L43P_GCLK5_M1DQ4_1 PIFPGA0H17 H18 DATA0NLDATA00BUS12.BUS12 IO_L43N_GCLK4_M1DQ5_1 PIFPGA0H18 J16 DATA0NLDATA00BUS13.BUS13 IO_L44P_A3_M1DQ6_1 PIFPGA0J16 J18 DATA0NLDATA00BUS14.BUS14 IO_L44N_A2_M1DQ7_1 PIFPGA0J18 K17 DATA0NLDATA00BUS15.BUS15 IO_L45P_A1_M1LDQS_1 PIFPGA0K17 K18 DATA0NLDATA00UB.UB IO_L45N_A0_M1LDQSN_1 PIFPGA0K18 L17 DATA0NLDATA00LB.LB IO_L46P_FCS_B_M1DQ2_1 PIFPGA0L17 L18 DATA0NLDATA00WRITE.WRITE IO_L46N_FOE_B_M1DQ3_1 PIFPGA0L18 M16 DATA0NLDATA00ENABLE.ENABLE IO_L47P_FWE_B_M1DQ0_1 PIFPGA0M16 M18 IO_L47N_LDC_M1DQ1_1 PIFPGA0M18 N17 IO_L48P_HDC_M1DQ8_1 PIFPGA0N17 C N18 C IO_L48N_M1DQ9_1 PIFPGA0N18 P17 NLFPGA0LEDS0LED01FPGA_LEDS.LED_1 IO_L49P_M1DQ10_1 PIFPGA0P17 P18 NLFPGA0LEDS0LED02FPGA_LEDS.LED_2 IO_L49N_M1DQ11_1 PIFPGA0P18 N15 IO_L50P_M1UDQS_1 PIFPGA0N15 N16 IO_L50N_M1UDQSN_1 PIFPGA0N16 T17 IO_L51P_M1DQ12_1 PIFPGA0T17 T18 IO_L51N_M1DQ13_1 PIFPGA0T18 U17 IO_L52P_M1DQ14_1 PIFPGA0U17 U18 IO_L52N_M1DQ15_1 PIFPGA0U18 M14 IO_L53P_1 PIFPGA0M14 N14 IO_L53N_VREF_1 PIFPGA0N14 L14 IO_L61P_1 PIFPGA0L14 M13 IO_L61N_1 PIFPGA0M13 P15 IO_L74P_AWAKE_1 PIFPGA0P15 P16 IO_L74N_DOUT_BUSY_1 PIFPGA0P16 XC6SLX45-2CSG324I

Title D FPGA Bank 1 - Memory interface D

Size Number Revision v.1 A Date: 11/25/2014 Sheet of File: C:\Users\..\FPGA_IO_2.SchDoc Drawn By: 1 2 3 4 1 2 3 4 EBI SPI A[NLA025000025..0] A[25..0] backup_TX DATA[NLDATA015000015..0] AD[15..0] backup_RX NLSPISPI COFPGA0CLK0JMPFPGA_CLK_JMP POSPIPOSPI0BACKUP0CLKPOSPI0BACKUP0CSPOSPI0BACKUP0RXPOSPI0BACKUP0TXSPI COFPGACFPGAC backup_CLK NLCS1CS1 CS1 POEBIPOEBI0A0POEBI0A1POEBI0A2POEBI0A3POEBI0A4POEBI0A5POEBI0A6POEBI0A7POEBI0A8POEBI0A9POEBI0A10POEBI0A11POEBI0A12POEBI0A13POEBI0A14POEBI0A15POEBI0A16POEBI0A17POEBI0A18POEBI0A19POEBI0A20POEBI0A21POEBI0A22POEBI0A23POEBI0A24POEBI0A25POEBI0A0250000POEBI0AD0POEBI0AD1POEBI0AD2POEBI0AD3POEBI0AD4POEBI0AD5POEBI0AD6POEBI0AD7POEBI0AD8POEBI0AD9POEBI0AD10POEBI0AD11POEBI0AD12POEBI0AD13POEBI0AD14POEBI0AD15POEBI0AD0150000POEBI0CS0POEBI0CS1POEBI0RENPOEBI0WEEBI PIFPGA0R15R15 Backup.GPIO0 backup_CS CNLCS0S0 IO_L1P_CCLK_2 1 2 CS0 2 PIFPGA0T15T15 Backup.GPIO1 WNLWEnEn IO_L1N_M0_CMPMISO_2 NLREn WE K U16 Backup.GPIO2 REn A IO_L2P_CMPCLK_2 PIFPGA0U16 REn A N V16 Backup.GPIO3 NLFPGA0CLK0SIGFPGA.CLK_SIG PIFPGA0CLK0JMP01 PIFPGA0CLK0JMP02 CLK IO_L2N_CMPMOSI_2 PIFPGA0V16 A PIFPGA0R13R13 Header 2 FPGA-CLK

B IO_L3P_D0_DIN_MISO_MISO1_2 T13 FPGA.CLK_SIG CNLCLKLK NLFPGAFPGA IO_L3N_MOSI_CSI_B_MISO0_2 PIFPGA0T13 FPGA-CLK POClkPOCLKPOCLK0FPGA0CLKClk U15 NLSPI0backup0TXSPI.backup_TX COEBI0addressEBI_address IO_L5P_2 PIFPGA0U15 V15 NLSPI0backup0RXSPI.backup_RX NLFPGA0RenFPGA.Ren REn IO_L5N_2 PIFPGA0V15 PIEBI0address01 1 2 PIEBI0address02 T14 NLSPI0backup0CLKSPI.backup_CLK NLFPGA0WEnFPGA.WEn WEn IO_L12P_D1_MISO2_2 PIFPGA0T14 PIEBI0address03 3 4 PIEBI0address04 V14 NLSPI0backup0CSSPI.backup_CS NLFPGA0CS0FPGA.CS0 CS0 IO_L12N_D2_MISO3_2 PIFPGA0V14 PIEBI0address05 5 6 PIEBI0address06 N12 NLFPGA0CS1FPGA.CS1 CS1 IO_L13P_M1_2 PIFPGA0N12 PIEBI0address07 7 8 PIEBI0address08 P12 FPGA.Ren NLFPGA0A25FPGA.A25 NLA25A25 IO_L13N_D10_2 PIFPGA0P12 PIEBI0address09 9 10 PIEBI0address010 U13 FPGA.WEn NLFPGA0A24FPGA.A24 NLA24A24 GPIO IO_L14P_D11_2 PIFPGA0U13 PIEBI0address011 11 12 PIEBI0address012 V13 FPGA.CS0 NLFPGA0A23FPGA.A23 NLA23A23 IO_L14N_D12_2 PIFPGA0V13 PIEBI0address013 13 14 PIEBI0address014 GPIO0 M11 FPGA.CS1 NLFPGA0A22FPGA.A22 NLA22A22 NLGPIOGPIO IO_L15P_2 PIFPGA0M11 PIEBI0address015 15 16 PIEBI0address016 GPIO1 POGPIOPOGPIO0GPIO0POGPIO0GPIO1POGPIO0GPIO2POGPIO0GPIO3GPIO N11 NLFPGA0A21FPGA.A21 NLA21A21 IO_L15N_2 PIFPGA0N11 PIEBI0address017 17 18 PIEBI0address018 GPIO2 R11 FPGA.A25 NLFPGA0A20FPGA.A20 NLA20A20 IO_L16P_2 PIFPGA0R11 PIEBI0address019 19 20 PIEBI0address020 GPIO3 T11 FPGA.A24 NLFPGA0A19FPGA.A19 NLA19A19 IO_L16N_VREF_2 PIFPGA0T11 PIEBI0address021 21 22 PIEBI0address022 T12 FPGA.A23 NLFPGA0A18FPGA.A18 NLA18A18 COGPIOGPIO IO_L19P_2 PIFPGA0T12 PIEBI0address023 23 24 PIEBI0address024 V12 FPGA.A22 NLFPGA0A17FPGA.A17 NLA17A17 NLBackup0GPIO0Backup.GPIO0 NLGPIO0GPIO0GPIO.GPIO0 IO_L19N_2 PIFPGA0V12 PIEBI0address025 25 26 PIEBI0address026 PIGPIO01 1 2 PIGPIO02 N10 FPGA.A21 NLFPGA0A16FPGA.A16 NLA16A16 NLBackup0GPIO1Backup.GPIO1 NLGPIO0GPIO1GPIO.GPIO1 IO_L20P_2 PIFPGA0N10 PIEBI0address027 27 28 PIEBI0address028 PIGPIO03 3 4 PIGPIO04 P11 FPGA.A20 NLFPGA0A15FPGA.A15 NLA15A15 NLBackup0GPIO2Backup.GPIO2 NLGPIO0GPIO2GPIO.GPIO2 B IO_L20N_2 PIFPGA0P11 PIEBI0address029 29 30 PIEBI0address030 PIGPIO05 5 6 PIGPIO06 B M10 FPGA.A19 NLFPGA0A14FPGA.A14 NLA14A14 NLBackup0GPIO3Backup.GPIO3 NLGPIO0GPIO3GPIO.GPIO3 IO_L22P_2 PIFPGA0M10 PIEBI0address031 31 32 PIEBI0address032 PIGPIO07 7 8 PIGPIO08 N9 FPGA.A18 NLFPGA0A13FPGA.A13 NLA13A13 IO_L22N_2 PIFPGA0N9 PIEBI0address033 33 34 PIEBI0address034 U11 FPGA.A17 NLFPGA0A12FPGA.A12 NLA12A12 Header 4X2 IO_L23P_2 PIFPGA0U11 PIEBI0address035 35 36 PIEBI0address036 V11 FPGA.A16 NLFPGA0A11FPGA.A11 NLA11A11 IO_L23N_2 PIFPGA0V11 PIEBI0address037 37 38 PIEBI0address038 R10 FPGA.A15 NLFPGA0A10FPGA.A10 NLA10A10 IO_L29P_GCLK3_2 PIFPGA0R10 PIEBI0address039 39 40 PIEBI0address040 T10 FPGA.A14 NLFPGA0A09FPGA.A09 NLA9A9 IO_L29N_GCLK2_2 PIFPGA0T10 PIEBI0address041 41 42 PIEBI0address042 U10 FPGA.A13 NLFPGA0A08FPGA.A08 NLA8A8 IO_L30P_GCLK1_D13_2 PIFPGA0U10 PIEBI0address043 43 44 PIEBI0address044 V10 FPGA.A12 NLFPGA0A07FPGA.A07 NLA7A7 IO_L30N_GCLK0_USERCCLK_2 PIFPGA0V10 PIEBI0address045 45 46 PIEBI0address046 R8 FPGA.A11 NLFPGA0A06FPGA.A06 NLA6A6 IO_L31P_GCLK31_D14_2 PIFPGA0R8 PIEBI0address047 47 48 PIEBI0address048 T8 FPGA.A10 NLFPGA0A05FPGA.A05 NLA5A5 IO_L31N_GCLK30_D15_2 PIFPGA0T8 PIEBI0address049 49 50 PIEBI0address050 T9 FPGA.A09 NLFPGA0A04FPGA.A04 NLA4A4 IO_L32P_GCLK29_2 PIFPGA0T9 PIEBI0address051 51 52 PIEBI0address052 V9 FPGA.A08 NLFPGA0A03FPGA.A03 NLA3A3 IO_L32N_GCLK28_2 PIFPGA0V9 PIEBI0address053 53 54 PIEBI0address054 M8 FPGA.A07 NLFPGA0A02FPGA.A02 NLA2A2 IO_L40P_2 PIFPGA0M8 PIEBI0address055 55 56 PIEBI0address056 N8 FPGA.A06 NLFPGA0A01FPGA.A01 NLA1A1 IO_L40N_2 PIFPGA0N8 PIEBI0address057 57 58 PIEBI0address058 U8 FPGA.A05 NLFPGA0A00FPGA.A00 NLA0A0 IO_L41P_2 PIFPGA0U8 PIEBI0address059 59 60 PIEBI0address060 V8 FPGA.A04 IO_L41N_VREF_2 PIFPGA0V8 U7 FPGA.A03 Backup header for EBI adr IO_L43P_2 PIFPGA0U7 V7 FPGA.A02 COEBI0DATAEBI_DATA IO_L43N_2 PIFPGA0V7 C N7 FPGA.A01 NLFPGA0AD15FPGA.AD15 DATA1NLDATA155 C IO_L44P_2 PIFPGA0N7 PIEBI0DATA01 1 2 PIEBI0DATA02 P8 FPGA.A00 NLFPGA0AD14FPGA.AD14 DATA1NLDATA144 IO_L44N_2 PIFPGA0P8 PIEBI0DATA03 3 4 PIEBI0DATA04 T6 NLFPGA0AD13FPGA.AD13 DATA1NLDATA133 IO_L45P_2 PIFPGA0T6 PIEBI0DATA05 5 6 PIEBI0DATA06 V6 FPGA.AD15 NLFPGA0AD12FPGA.AD12 DATA1NLDATA122 IO_L45N_2 PIFPGA0V6 PIEBI0DATA07 7 8 PIEBI0DATA08 R7 FPGA.AD14 NLFPGA0AD11FPGA.AD11 DATA1NLDATA111 IO_L46P_2 PIFPGA0R7 PIEBI0DATA09 9 10 PIEBI0DATA010 T7 FPGA.AD13 NLFPGA0AD10FPGA.AD10 DATA1NLDATA100 IO_L46N_2 PIFPGA0T7 PIEBI0DATA011 11 12 PIEBI0DATA012 N6 FPGA.AD12 NLFPGA0AD09FPGA.AD09 DATA9NLDATA9 IO_L47P_2 PIFPGA0N6 PIEBI0DATA013 13 14 PIEBI0DATA014 P7 FPGA.AD11 NLFPGA0AD08FPGA.AD08 DATA8NLDATA8 IO_L47N_2 PIFPGA0P7 PIEBI0DATA015 15 16 PIEBI0DATA016 R5 FPGA.AD10 NLFPGA0AD07FPGA.AD07 DATA7NLDATA7 IO_L48P_D7_2 PIFPGA0R5 PIEBI0DATA017 17 18 PIEBI0DATA018 T5 FPGA.AD09 NLFPGA0AD06FPGA.AD06 DATA6NLDATA6 IO_L48N_RDWR_B_VREF_2 PIFPGA0T5 PIEBI0DATA019 19 20 PIEBI0DATA020 U5 FPGA.AD08 NLFPGA0AD05FPGA.AD05 DATA5NLDATA5 IO_L49P_D3_2 PIFPGA0U5 PIEBI0DATA021 21 22 PIEBI0DATA022 V5 FPGA.AD07 NLFPGA0AD04FPGA.AD04 DATA4NLDATA4 IO_L49N_D4_2 PIFPGA0V5 PIEBI0DATA023 23 24 PIEBI0DATA024 R3 FPGA.AD06 NLFPGA0AD03FPGA.AD03 DATA3NLDATA3 IO_L62P_D5_2 PIFPGA0R3 PIEBI0DATA025 25 26 PIEBI0DATA026 T3 FPGA.AD05 NLFPGA0AD02FPGA.AD02 DATA2NLDATA2 IO_L62N_D6_2 PIFPGA0T3 PIEBI0DATA027 27 28 PIEBI0DATA028 T4 FPGA.AD04 NLFPGA0AD01FPGA.AD01 DATA1NLDATA1 IO_L63P_2 PIFPGA0T4 PIEBI0DATA029 29 30 PIEBI0DATA030 V4 FPGA.AD03 NLFPGA0AD00FPGA.AD00 DATA0NLDATA0 IO_L63N_2 PIFPGA0V4 PIEBI0DATA031 31 32 PIEBI0DATA032 N5 FPGA.AD02 IO_L64P_D8_2 PIFPGA0N5 P6 FPGA.AD01 Backup header for EBI Data IO_L64N_D9_2 PIFPGA0P6 U3 FPGA.AD00 Title IO_L65P_INIT_B_2 PIFPGA0U3 D V3 FPGA Bank 2 - MCU connector D IO_L65N_CSO_B_2 PIFPGA0V3 XC6SLX45-2CSG324I Size Number Revision A Date: 11/25/2014 Sheet of File: C:\Users\..\FPGA_IO_3.SchDoc Drawn By: 1 2 3 4 1 2 3 4

COFPGADFPGAD N4 NLDATA10ADDR0DATA1.ADDR0 IO_L1P_3 PIFPGA0N4 3 PIFPGA0N3N3 NLDATA10ADDR1DATA1.ADDR1 IO_L1N_VREF_3 NLDATA10ADDR2 K P4 DATA1.ADDR2 A IO_L2P_3 PIFPGA0P4 A N P3 NLDATA10ADDR3DATA1.ADDR3 IO_L2N_3 PIFPGA0P3

A NLDATA10ADDR4 PIFPGA0L6L6 DATA1.ADDR4

B IO_L31P_3 M5 NLDATA10ADDR5DATA1.ADDR5 IO_L31N_VREF_3 PIFPGA0M5 U2 NLDATA10ADDR6DATA1.ADDR6 IO_L32P_M3DQ14_3 PIFPGA0U2 U1 NLDATA10ADDR7DATA1.ADDR7 IO_L32N_M3DQ15_3 PIFPGA0U1 T2 NLDATA10ADDR8DATA1.ADDR8 IO_L33P_M3DQ12_3 PIFPGA0T2 T1 NLDATA10ADDR9DATA1.ADDR9 IO_L33N_M3DQ13_3 PIFPGA0T1 P2 NLDATA10ADDR10DATA1.ADDR10 IO_L34P_M3UDQS_3 PIFPGA0P2 PIFPGA0P1P1 NLDATA10ADDR11DATA1.ADDR11 DATA_MEMORY_1 IO_L34N_M3UDQSN_3 NLDATA10ADDR12 N2 DATA1.ADDR12 ] ] E E PIFPGA0N2 B B IO_L35P_M3DQ10_3 5 8 T L L 1 1 NLDATA10ADDR13 U I N1 DATA1.ADDR13 . . . . IO_L35N_M3DQ11_3 PIFPGA0N1 B R NLDATA10ADDR14 0 0 A PIFPGA0M3M3 DATA1.ADDR14 [ [ W S

IO_L36P_M3DQ8_3 N R M1 NLDATA10ADDR15DATA1.ADDR15 E U IO_L36N_M3DQ9_3 PIFPGA0M1 D B

NLDATA10ADDR16 D PIFPGA0L2L2 DATA1.ADDR16

IO_L37P_M3DQ0_3 A L1 NLDATA10ADDR17DATA1.ADDR17 IO_L37N_M3DQ1_3 PIFPGA0L1 K2 NLDATA10ADDR18DATA1.ADDR18 IO_L38P_M3DQ2_3 PIFPGA0K2 K1 NLDATA10BUS0DATA1.BUS0 IO_L38N_M3DQ3_3 PIFPGA0K1 L4 NLDATA10BUS1DATA1.BUS1 IO_L39P_M3LDQS_3 PIFPGA0L4 L3 NLDATA10BUS2DATA1.BUS2 B IO_L39N_M3LDQSN_3 PIFPGA0L3 B J3 NLDATA10BUS3DATA1.BUS3 IO_L40P_M3DQ6_3 PIFPGA0J3 J1 NLDATA10BUS4DATA1.BUS4 IO_L40N_M3DQ7_3 PIFPGA0J1 H2 NLDATA10BUS5DATA1.BUS5 IO_L41P_GCLK27_M3DQ4_3 PIFPGA0H2 H1 NLDATA10BUS6DATA1.BUS6 IO_L41N_GCLK26_M3DQ5_3 PIFPGA0H1 K4 NLDATA10BUS7DATA1.BUS7 IO_L42P_GCLK25_TRDY2_M3UDM_3 PIFPGA0K4 PIFPGA0K3K3 NLDATA10BUS8DATA1.BUS8 IO_L42N_GCLK24_M3LDM_3 NLDATA10BUS9 PIFPGA0L5L5 DATA1.BUS9

IO_L43P_GCLK23_M3RASN_3 1 K5 NLDATA10BUS10DATA1.BUS10 IO_L43N_GCLK22_IRDY2_M3CASN_3 PIFPGA0K5 A

NLDATA10BUS11 T PIFPGA0H4H4 DATA1.BUS11 IO_L44P_GCLK21_M3A5_3 NLDATA10BUS12 NLDATA1A

H3 DATA1.BUS12 D IO_L44N_GCLK20_M3A6_3 PIFPGA0H3 L7 NLDATA10BUS13DATA1.BUS13 IO_L45P_M3A3_3 PIFPGA0L7 K6 NLDATA10BUS14DATA1.BUS14 IO_L45N_M3ODT_3 PIFPGA0K6 G3 NLDATA10BUS15DATA1.BUS15 IO_L46P_M3CLK_3 PIFPGA0G3 G1 NLDATA10UBDATA1.UB PIFPGA0G1 1

IO_L46N_M3CLKN_3 _ J7 NLDATA10LBDATA1.LB IO_L47P_M3A0_3 PIFPGA0J7 A

NLDATA10WRITE T PIFPGA0J6J6 DATA1.WRITE IO_L47N_M3A1_3 NLDATA10ENABLE A F2 DATA1.ENABLE PODATA01PODATA010ADDR0PODATA010ADDR1PODATA010ADDR2PODATA010ADDR3PODATA010ADDR4PODATA010ADDR5PODATA010ADDR6PODATA010ADDR7PODATA010ADDR8PODATA010ADDR9PODATA010ADDR10PODATA010ADDR11PODATA010ADDR12PODATA010ADDR13PODATA010ADDR14PODATA010ADDR15PODATA010ADDR16PODATA010ADDR17PODATA010ADDR18PODATA010ADDR0000180PODATA010BUS0PODATA010BUS1PODATA010BUS2PODATA010BUS3PODATA010BUS4PODATA010BUS5PODATA010BUS6PODATA010BUS7PODATA010BUS8PODATA010BUS9PODATA010BUS10PODATA010BUS11PODATA010BUS12PODATA010BUS13PODATA010BUS14PODATA010BUS15PODATA010BUS0000150PODATA010ENABLEPODATA010LBPODATA010UBPODATA010WRITED IO_L48P_M3BA0_3 PIFPGA0F2 F1 IO_L48N_M3BA1_3 PIFPGA0F1 C H6 C IO_L49P_M3A7_3 PIFPGA0H6 H5 IO_L49N_M3A2_3 PIFPGA0H5 E3 IO_L50P_M3WE_3 PIFPGA0E3 E1 IO_L50N_M3BA2_3 PIFPGA0E1 F4 IO_L51P_M3A10_3 PIFPGA0F4 F3 IO_L51N_M3A4_3 PIFPGA0F3 D2 IO_L52P_M3A8_3 PIFPGA0D2 D1 IO_L52N_M3A9_3 PIFPGA0D1 H7 IO_L53P_M3CKE_3 PIFPGA0H7 G6 IO_L53N_M3A12_3 PIFPGA0G6 E4 IO_L54P_M3RESET_3 PIFPGA0E4 D3 IO_L54N_M3A11_3 PIFPGA0D3 F6 IO_L55P_M3A13_3 PIFPGA0F6 F5 IO_L55N_M3A14_3 PIFPGA0F5 C2 3.3V IO_L83P_3 PIFPGA0C2 C1 IO_L83N_VREF_3 PIFPGA0C1 2079001 XC6SLX45-2CSG324I PIR701 RESET SIGNAL RCOR77 1570392 Title D Res1 COFPGA0RESETFPGA_RESET FPGA Bank 3 - Memory interface D 1K PIR702 PIFPGA0RESET033 PIFPGA0RESET044 PIFPGA0RESET011 PIFPGA0RESET022 Size Number Revision A GND FSM2JSMA Date: 11/25/2014 Sheet of File: C:\Users\..\FPGA_IO_4.SchDoc Drawn By: 1 2 3 4 1 2 3 4

3.3V

A A

2281048 2281048 1358552 1833803 FPGAFCOFPGAF PIC3601 PIC3701 PIC3802 PIC3902 B10 B1 PIFPGA0B10 VCCO_0 VCCAUX PIFPGA0B1 COC36C36 COC37C37 COC38C38 COC39C39 B15 B17 1833803 1358552 2281048 2281048 2281048 PIFPGA0B15 VCCO_0 VCCAUX PIFPGA0B17 PIC3602 Cvcco_0 PIC3702 Cvcco_0 PIC3801 Cvcco_0 PIC3901 Cvcco_0 B5 E14 PIC5202 PIC5302 PIC5401 PIC5501 PIC5601 PIFPGA0B5 VCCO_0 VCCAUX PIFPGA0E14 470nF 470nF 100uF 4.7uF D13 E5 COC52C52 CCOC5353 COC54C54 COC55C55 COC56C56 PIFPGA0D13 VCCO_0 VCCAUX PIFPGA0E5 D7 E9 PIC5201 Cvccaux PIC5301 Cvccaux PIC5402 Cvccaux PIC5502 Cvccaux PIC5602 Cvccaux PIFPGA0D7 VCCO_0 VCCAUX PIFPGA0E9 E10 G10 4.7uF 100uF 470nF 470nF 470nF PIFPGA0E10 VCCO_0 VCCAUX PIFPGA0G10 J12 VCCAUX PIFPGA0J12 2281048 2281048 1358552 1833803 E17 K7 PIFPGA0E17 VCCO_1 VCCAUX PIFPGA0K7 PIC4001 PIC4101 PIC4201 PIC4302 G15 M9 PIFPGA0G15 VCCO_1 VCCAUX PIFPGA0M9 COC40C40 COC41C41 COC42C42 COC43C43 J14 P10 PIFPGA0J14 VCCO_1 VCCAUX PIFPGA0P10 PIC4002 Cvcco_1 PIC4102 Cvcco_1 PIC4202 Cvcco_1 PIC4301 Cvcco_1 J17 P14 PIFPGA0J17 VCCO_1 VCCAUX PIFPGA0P14 470nF 470nF 100uF 4.7uF M15 P5 PIFPGA0M15 VCCO_1 VCCAUX PIFPGA0P5 R17 1.2V PIFPGA0R17 VCCO_1 G7 VCCINT PIFPGA0G7 2281048 P9 H11 PIFPGA0P9 VCCO_2 VCCINT PIFPGA0H11 PIC4401 2281048 1358552 1833803 R12 H9 B PIFPGA0R12 VCCO_2 VCCINT PIFPGA0H9 B COC44C44 PIC4501 PIC4602 PIC4702 R6 J10 PIFPGA0R6 VCCO_2 VCCINT PIFPGA0J10 PIC4402 Cvcco_2 COC45C45 COC46C46 CCOC4747 U14 J8 PIFPGA0U14 VCCO_2 VCCINT PIFPGA0J8 470nF PIC4502 Cvcco_2 PIC4601 Cvcco_2 PIC4701 Cvcco_2 U4 K11 PIFPGA0U4 VCCO_2 VCCINT PIFPGA0K11 470nF 100uF 4.7uF U9 K9 PIFPGA0U9 VCCO_2 VCCINT PIFPGA0K9 L10 VCCINT PIFPGA0L10 E2 L8 1833803 1358552 2281048 2281048 PIFPGA0E2 VCCO_3 VCCINT PIFPGA0L8 PIFPGA0G4G4 PIFPGA0M12M12 PIC5702 PIC5802 PIC5901 PIC6001 2281048 2281048 VCCO_3 VCCINT CCOC5757 COC58C58 COC59C59 COC60C60 1358552 183380PIFPGA0J23 J2 PIFPGA0M7M7 VCCO_3 VCCINT Cvccint Cvccint Cvccint Cvccint PIC4801 PIC4901 PIC5002 PIC5102 PIFPGA0J5J5 PIC5701 PIC5801 PIC5902 PIC6002 COC48C48 COC49C49 COC50C50 COC51C51 VCCO_3 4.7uF 100uF 470nF 470nF PIFPGA0M4M4 PIC4802 Cvcco_PIC49023 Cvcco_PIC50013 Cvcco_3 PIC5101 Cvcco_3 VCCO_3 PIFPGA0R2R2 470nF 470nF 100uF 4.7uF VCCO_3 XC6SLX45-2CSG324I

GND

C C

GND

Title D FPAG power D

Size Number Revision A Date: 11/25/2014 Sheet of File: C:\Users\..\FPGA_VCC.SchDoc Drawn By: 1 2 3 4 1 2 3 4

A A

COFPGAGFPGAG A1 K10 PIFPGA0A1 GND GND PIFPGA0K10 A18 K8 PIFPGA0A18 GND GND PIFPGA0K8 B13 L11 PIFPGA0B13 GND GND PIFPGA0L11 B7 L9 PIFPGA0B7 GND GND PIFPGA0L9 C16 M17 PIFPGA0C16 GND GND PIFPGA0M17 C3 M2 PIFPGA0C3 GND GND PIFPGA0M2 D10 M6 PIFPGA0D10 GND GND PIFPGA0M6 D5 N13 PIFPGA0D5 GND GND PIFPGA0N13 E15 R1 PIFPGA0E15 GND GND PIFPGA0R1 G12 R14 PIFPGA0G12 GND GND PIFPGA0R14 G17 R18 PIFPGA0G17 GND GND PIFPGA0R18 G2 R4 PIFPGA0G2 GND GND PIFPGA0R4 G5 R9 PIFPGA0G5 GND GND PIFPGA0R9 H10 T16 PIFPGA0H10 GND GND PIFPGA0T16 H8 U12 B PIFPGA0H8 GND GND PIFPGA0U12 B J11 U6 PIFPGA0J11 GND GND PIFPGA0U6 J15 V1 PIFPGA0J15 GND GND PIFPGA0V1 J4 V18 PIFPGA0J4 GND GND PIFPGA0V18 J9 PIFPGA0J9 GND XC6SLX45-2CSG324I

GND

C C

Title D FPGA GND D

Size Number Revision A Date: 11/25/2014 Sheet of File: C:\Users\..\FPGA_GND.SchDoc Drawn By: 1 2 3 4 1 2 3 4

A A

HDMI

D0+ D0-

D1+ COHDMI0headerHDMI_header NLHDMIHDMI D1- POHDMIPOHDMI0CLK0POHDMI0D00POHDMI0D10POHDMI0D20HDMI NLHDMI0D20HDMI.D2+ I_HDMI.D2+ PIHDMI0header01 1 2 PIHDMI0header02 NLHDMI0D20HDMI.D2- I_HDMI.D2- PIHDMI0header03 3 4 PIHDMI0header04 D2+ NLHDMI0D10HDMI.D1+ I_HDMI.D1+ PIHDMI0header05 5 6 PIHDMI0header06 D2- NLHDMI0D10HDMI.D1- I_HDMI.D1- PIHDMI0header07 7 8 PIHDMI0header08 NLHDMI0D00HDMI.D0+ I_HDMI.D0+ PIHDMI0header09 9 10 PIHDMI0header010 CLK+ NLHDMI0D00HDMI.D0- I_HDMI.D0- PIHDMI0header011 11 12 PIHDMI0header012 CLK- NLHDMI0CLK0HDMI.CLK+ I_HDMI.CLK+ PIHDMI0header013 13 14 PIHDMI0header014 NLHDMI0CLK0HDMI.CLK- I_HDMI.CLK- B PIHDMI0header015 15 16 PIHDMI0header016 B Header 8X2

609-1010-2-ND CO I_NLI0HDMI0D20HDMI.D2+ 1 D2+ PI01 2 D2S PI02 I_NLI0HDMI0D20HDMI.D2- 3 D2- PI03 I_NLI0HDMI0D10HDMI.D1+ 4 D1+ PI04 5 D1S PI05 I_NLI0HDMI0D10HDMI.D1- 6 D1- PI06 COP1P1 I_NLI0HDMI0D00HDMI.D0+ 7 D0+ PI07 8 NLI0HDMI0D00 D0S PI08 I_HDMI.D0- PI099 1 2 D0- I_NLI0HDMI0CLK0HDMI.CLK+ 10 CLK+ PI010 11 CLKS PI011 PIP101 PIP102 I_NLI0HDMI0CLK0HDMI.CLK- 12 CLK- PI012 13 C CEC CEC PI013 C COP2P2 14 RES PI014 15 PIP201 1 2 PIP202 SCL PI015 16 PIP203 3 4 PIP204 SDA PI016 17 DDC PI017 SCL_SCD 18 +5V PI018 COP3P3 19 DETECT PI019 HDMI_connector 1 2 PIP301 PIP302 COP4P4 +5V 1 2 3 PIP401 PIP402 PIP403 Header 3 GND

Title D D HDMI connector

Size Number Revision A4 Date: 11/25/2014 Sheet of File: C:\Users\..\HDMI.SchDoc Drawn By: 1 2 3 4 1 2 3 4

UART0 EBI NLEBI0A0250000EBI.A[25..0] TX A[25..0] NLEBI0AD0150000EBI.AD[15..0] RX AD[15..0] NLUART0UART0 CTRL_RX POUART0POUART00CTRL0RXPOUART00CTRL0TXPOUART00RXPOUART00TXUART0 NLEBI0CS0EBI.CS0 CS0 CTRL_TX NLEBI0CS1EBI.CS1 POEBIPOEBI0A0POEBI0A1POEBI0A2POEBI0A3POEBI0A4POEBI0A5POEBI0A6POEBI0A7POEBI0A8POEBI0A9POEBI0A10POEBI0A11POEBI0A12POEBI0A13POEBI0A14POEBI0A15POEBI0A16POEBI0A17POEBI0A18POEBI0A19POEBI0A20POEBI0A21POEBI0A22POEBI0A23POEBI0A24POEBI0A25POEBI0A0250000POEBI0AD0POEBI0AD1POEBI0AD2POEBI0AD3POEBI0AD4POEBI0AD5POEBI0AD6POEBI0AD7POEBI0AD8POEBI0AD9POEBI0AD10POEBI0AD11POEBI0AD12POEBI0AD13POEBI0AD14POEBI0AD15POEBI0AD0150000POEBI0CS0POEBI0CS1POEBI0RENPOEBI0WEEBI CS1 A NLEBI0WEnEBI.WEn A WE SERIAL NLEBI0RenEBI.Ren COMCUAMCUA REn NLEBI0AD9EBI.AD9 C2 E4 NLEBI0A16EBI.A16 PIMCU0C2 PA0 PB0 PIMCU0E4 NLEBI0AD10EBI.AD10 C1 F1 NLEBI0A17EBI.A17 MCU_LEDS PIMCU0C1 PA1 PB1 PIMCU0F1 NLEBI0AD11EBI.AD11 D2 F2 NLEBI0A18EBI.A18 PIMCU0D2 PA2 PB2 PIMCU0F2 LED_1 NLMCU0LEDSMCU_LEDS NLEBI0AD12EBI.AD12 D1 F3 NLEBI0A19EBI.A19 GPIO POMCU0LEDSPOMCU0LEDS0LED01POMCU0LEDS0LED02MCU_LEDS PIMCU0D1 PA3 PB3 PIMCU0F3 LED_2 NLEBI0AD13EBI.AD13 E3 F4 NLEBI0A20EBI.A20 PIMCU0E3 PA4 PB4 PIMCU0F4 GPIO0 NLEBI0AD14EBI.AD14 E2 G1 NLEBI0A21EBI.A21 NLGPIOGPIO PIMCU0E2 PA5 PB5 PIMCU0G1 GPIO1 POGPIOPOGPIO0GPIO0POGPIO0GPIO1POGPIO0GPIO2POGPIO0GPIO3GPIO NLEBI0AD15EBI.AD15 E1 G2 NLEBI0A22EBI.A22 PIMCU0E1 PA6 PB6 PIMCU0G2 PB7 GPIO2 VGA-I/O NLEBI0CSTFTEBI.CSTFT NLMCU0VGA0I0O0J10p15MCU_VGA-I/O.J1H4_p15 K1 PIMCU0H4 PA7 PB7 PIMCU0K1 P POPB7POPB70PPB7 GPIO3 NLEBI0DCLKEBI.DCLK NLMCU0VGA0I0O0J10p14MCU_VGA-I/O.J1H5_p14 L1 J1_p3 PIMCU0H5 PA8 PB8 PIMCU0L1 NLMCU0VGA0I0O0J20p0MCU_VGA-I/O.J2J5_p0 J7 NLEBI0A3EBI.A3 J1_p2 PIMCU0J5 PA9 PB9 PIMCU0J7 PB8 NLMCU0VGA0I0O0J20p1MCU_VGA-I/O.J2J6_p1 J8 NLEBI0A4EBI.A4 J1_p8 PIMCU0J6 PA10 PB10 PIMCU0J8 P POPB8POPB80PPB8 SPI NLEBI0HSNCEBI.HSNC K5 L5 UARNLUART00CTRL0RXT0.CTRL_RX J1_p9 PIMCU0K5 PA11 PB11 PIMCU0L5 backup_TX NLEBI0A0EBI.A0 J4 L6 UARNLUART00CTRL0TXT0.CTRL_TX PB13 J1_p10 PIMCU0J4 PA12 PB12 PIMCU0L6 backup_RX SPINLSPI NLEBI0A1EBI.A1 K3 L8 P POPB13POPB130PPB13 POSPIPOSPI0BACKUP0CLKPOSPI0BACKUP0CSPOSPI0BACKUP0RXPOSPI0BACKUP0TXSPI J1_p11 PIMCU0K3 PA13 PB13 PIMCU0L8 backup_CLK NLEBI0A2EBI.A2 L3 L9 J1_p12 PIMCU0L3 PA14 PB14 PIMCU0L9 PB14 backup_CS NLEBI0AD8EBI.AD8 B1 D3 MCNLMCU0VGA0I0O0J20p4U_VGA-I/O.J2_p4 J1_p13 PIMCU0B1 PA15 PB15 PIMCU0D3 P POPB14POPB140PPB14 B J1_p14 COSPI0HEADERSPI_HEADER B NLMCU0VGA0I0OMCU_VGA-I/O NLEBI0A23EBI.A23 H1 L11 I_SPI.TX J1_p15 POVGA0I0OPOVGA0I0O0J10P2POVGA0I0O0J10P3POVGA0I0O0J10P8POVGA0I0O0J10P9POVGA0I0O0J10P10POVGA0I0O0J10P11POVGA0I0O0J10P12POVGA0I0O0J10P13POVGA0I0O0J10P14POVGA0I0O0J10P15POVGA0I0O0J20P0POVGA0I0O0J20P1POVGA0I0O0J20P4POVGA0I0O0J20P5POVGA0I0O0J20P6POVGA0I0O0J20P7VGA-I/O PIMCU0H1 PC0 PD0 PIMCU0L11 NLI0SPI0TXI_SPI.TX NLSPI0backup0TXSPI.backup_TX NLEBI0A24EBI.A24 J1 K11 I_SPI.RX PISPI0HEADER01 1 2 PISPI0HEADER02 J2_p1 PIMCU0J1 PC1 PD1 PIMCU0K11 NLI0SPI0RXI_SPI.RX NLSPI0backup0RXSPI.backup_RX NLEBI0A25EBI.A25 H2 J9 I_SPI.CLK PISPI0HEADER03 3 4 PISPI0HEADER04 J2_p0 PIMCU0H2 PC2 PD2 PIMCU0J9 NLI0SPI0CLKI_SPI.CLK NLSPI0backup0CLKSPI.backup_CLK EBNLEBI0NANDREnI.NANDREn NLMCU0VGA0I0O0J10p13MCU_VGA-I/O.J1J2_p13 J10 I_SPI.CS PISPI0HEADER05 5 6 PISPI0HEADER06 J2_p6 PIMCU0J2 PC3 PD3 PIMCU0J10 NLI0SPI0CSI_SPI.CS NLSPI0backup0CSSPI.backup_CS EBNLEBI0A26I.A26 NLGPIO0GPIO0GPIO.GPIO0 K2 J11 MCNLMCU0VGA0I0O0J20p5U_VGA-I/O.J2_p5 PISPI0HEADER07 7 8 PISPI0HEADER08 J2_p4 PIMCU0K2 PC4 PD4 PIMCU0J11 EBNLEBI0NANDWEI.NANDWE NLMCU0VGA0I0O0J10p11MCU_VGA-I/O.J1L2_p11 H9 MCNLMCU0VGA0I0O0J20p6U_VGA-I/O.J2_p6 J2_p5 PIMCU0L2 PC5 PD5 PIMCU0H9 SPI_header NLEBI0A5EBI.A5 G10 H10 MCNLMCU0VGA0I0O0J20p7U_VGA-I/O.J2_p7 J2_p7 PIMCU0G10 PC6 PD6 PIMCU0H10 NLEBI0A6EBI.A6 G11 H11 MCNLMCU0VGA0I0O0J10p8U_VGA-I/O.J1_p8 PIMCU0G11 PC7 PD7 PIMCU0H11 NLEBI0A15EBI.A15 D10 H8 MCNLMCU0VGA0I0O0J10p9U_VGA-I/O.J1_p9 PIMCU0D10 PC8 PD8 PIMCU0H8 3.3V NLEBI0A9EBI.A9 D11 D6 EBI.CS0 PIMCU0D11 PC9 PD9 PIMCU0D6 NLEBI0A10EBI.A10 C10 A5 EBI.CS1 PIMCU0C10 PC10 PD10 PIMCU0A5 EBNLEBI0ALEI.ALE NLMCU0VGA0I0O0J10p10MCU_VGA-I/O.J1C1_1p10 B5 GPINLGPIO0GPIO1O.GPIO1 NLEBI0CS2EBI.CS2 GND PIMCU0C11 PC11 PD11 PIMCU0B5 C5 GPINLGPIO0GPIO2O.GPIO2 NLEBI0CS3EBI.CS3 COArm0progArm_prog PD12 PIMCU0C5 NLEBI0A7EBI.A7 E9 C4 GPINLGPIO0GPIO3O.GPIO3 PIMCU0E9 PE0 PD13 PIMCU0C4 PIArm0prog01 1 2 PIArm0prog02 NLEBI0A8EBI.A8 E10 H3 NLMCU0LEDS0LED02MCU_LEDS.LED_2 PIMCU0E10 PE1 PD14 PIMCU0H3 PIArm0prog03 3 4 PIArm0prog04 NLMCU0VGA0I0O0J10p2MCU_VGA-I/O.J1F1_0p2 J3 NLMCU0LEDS0LED01MCU_LEDS.LED_1 PIMCU0F10 PE2 PD15 PIMCU0J3 PIArm0prog05 5 6 PIArm0prog06 NLMCU0VGA0I0O0J10p3MCU_VGA-I/O.J1E1_1p3 PIMCU0E11 PE3 PIArm0prog07 7 8 PIArm0prog08 NLEBI0A11EBI.A11 A9 E8 PIMCU0A9 PE4 PF0 PIMCU0E8 PIArm0prog09 9 10 PIArm0prog010 C NLEBI0A12EBI.A12 B9 D8 C PIMCU0B9 PE5 PF1 PIMCU0D8 PIArm0prog011 11 12 PIArm0prog012 NLEBI0A13EBI.A13 C9 C8 PIMCU0C9 PE6 PF2 PIMCU0C8 PIArm0prog013 13 14 PIArm0prog014 NLEBI0A14EBI.A14 D9 A7 NLMCU0VGA0I0O0J10p12MCU_VGA-I/O.J1_p12 PIMCU0D9 PE7 PF5 PIMCU0A7 PIArm0prog015 15 16 PIArm0prog016 4 2 NLEBI0AD0EBI.AD0 B4 B7 NLUART00TXUART0.TX EBNLEBI0BL0I.BL0 PIMCU0RESET04 PIMCU0RESET02 PIMCU0B4 PE8 PF6 PIMCU0B7 PIArm0prog017 17 18 PIArm0prog018 COMCU0RESETMCU_RESET NLEBI0AD1EBI.AD1 A4 A6 NLUART00RXUART0.RX EBNLEBI0BL1I.BL1 PIMCU0A4 PE9 PF7 PIMCU0A6 PIArm0prog019 19 20 PIArm0prog020 FSM2JSMA NLEBI0AD2EBI.AD2 C3 B6 EBI.WEn D- PIMCU0C3 PE10 PF8 PIMCU0B6

PIMCU0RESET033 PIMCU0RESET011 NLEBI0AD3EBI.AD3 B3 C6 EBI.REn Header 10X2 PIMCU0B3 PE11 PF9 PIMCU0C6 P POusb0dmPOUSB0DMPOUSB0DM0Pusb_dm NLEBI0AD4EBI.AD4 A3 A10 PIMCU0A3 PE12 PF10 PIMCU0A10 NLEBI0AD5EBI.AD5 B2 A11 PIMCU0B2 PE13 PF11 PIMCU0A11 NLEBI0AD6EBI.AD6 PIMCU0A2A2 PIMCU0A8A8 D+ NLEBI0AD7 PE14 PF12 EBI.AD7 PIMCU0A1A1 POusb0dpPOUSB0DPPOUSB0DP0P PE15 COC1 P usb_dp C1 DONE_SGN K6 F11 PIMCU0K6 RESETn DECOUPLE PIMCU0F11 PIC101 PIC102 EFM32GG990F1024-BGA112 Done_signal PODONE0SIGPODONE0SIG0DONE0SIGNALDONE_SIG GND 1u

Title D MCU D

GND Size Number Revision 1 A Date: 11/25/2014 Sheet of File: C:\Users\..\MCU.SchDoc Drawn By: Julian Lam 1 2 3 4 1 2 3 4

A A

B COMCUBMCUB B B10 B11 PIMCU0B10 USB_VREGI USB_VREGO PIMCU0B11 B8 2280755 1833825 PIMCU0B8 USB_VBUS COC2 COC3 COC8 COC9 F8 F9 PIC202 C2 PIC302C3 PIC802C8 PIC902C9 PIC1002CCOC1010 PIC1402COC14 PIC1502 PIC1602 PIMCU0F8 VDD_DREG VSS_DREG PIMCU0F9 C14 G4 C7 CCOC1515 COC16C16 PIMCU0G4 IOVDD_0 VSS PIMCU0C7 L4 D4 Ciovdd_0 Ciovdd_1 Ciovdd_2 Ciovdd_3 Ciovdd_4Ciovdd_5 Ciovdd_6 Cdreg PIMCU0L4 IOVDD_1 VSS PIMCU0D4 PIC201 PIC301 PIC801 PIC901 PIC1001 PIC1401 PIC1501 PIC1601 H7 G3 100nF 100nF 100nF 100nF 100nF 100nF 100nF 10uF 3.3V PIMCU0H7 IOVDD_3 VSS PIMCU0G3 G8 G9 PIMCU0G8 IOVDD_4 VSS PIMCU0G9 D7 H6 PIMCU0D7 IOVDD_5 VSS PIMCU0H6 D5 K4 PIMCU0D5 IOVDD_6 VSS PIMCU0K4 L10 K10 PIMCU0L10 AVDD_0 AVSS_0 PIMCU0K10 K9 K7 PIMCU0K9 AVDD_1 AVSS_1 PIMCU0K7 K8 L7 PIMCU0K8 AVDD_2 AVSS_2 PIMCU0L7 EFM32GG990F1024-BGA112 PIC1702 1833803 PIC1802 PIC1902 PIC2001 CCOC1717 COC18C18 COC19C19 COC20C20 PIC1701 Cvregi PIC1801 Cavdd_0 PIC1901 PIC2002 C 4.7uF 100nF 100nF 100nF C 2280755 P

VBUS GND S U

B Title D POVBUSPOVBUS0PV MCU power D Size Number MCU_Power_plane Revision A Date: 11/25/2014 Sheet of 1 File: C:\Users\..\MCU_IO.SchDoc Drawn By: 1 2 3 4 Julian Lam 1 2 3 4

A A PB8 PB7 1216208 COXLOWXLOW 1 4 POPB7POPB70PPB7 P PIXLOW01 PIXLOW04 P POPB8POPB80PPB8 32.783KHz

535-10898-1-ND CLK_signal CLK_signal XCOXHIGHHIGH PIXHIGH011 PIXHIGH022 1650892 1650892 POPB13POPB130PPB13 P P POPB14POPB140PPB14 PIC2102 PIC2202 48MHz COC21C21 COC22C22 PIC2101 CL1 PIC2201 CL2 12pF 12pF B B

1886035 1886035 PIC2302 PIC2402 C2COC233 COC24C24 PIC2301 CL1 PIC2401 CL2 GND GND 22pF 22pF

COC25C25 PIC2501 PIC2502 GND 631-1327-2-ND 1740692 COFPGA0CLKFPGA_CLK bypass cap GND FPGA-CLK COC26 NLE0DE/D GNLGNDND 10nF GND C26 PIFPGA0CLK01 1 2 PIFPGA0CLK02 NLOUTPUTOUTPUT 1833876RL POClkPOCLKPOCLK0FPGA0CLKClk fpga-clk PIC2602 PIC2601 PIFPGA0CLK03 3 4 PIFPGA0CLK04 3.3V C C bypass cap 120MHz 15pF

GND

Title D D Oscillator & XTALs

Size Number Revision A4 Date: 11/25/2014 Sheet of File: C:\Users\..\MCU_oscilliation.SchDoc Drawn By: 1 2 3 4 1 2 3 4

A A G E R _ O T _

PO5V0TO0REGPO5V0TO0REG0PV 5

5V_TO_REG COP5P5 COP6P6 1 2 3 1 2 3 PW_5V PW_H2

P PIP501 PIP502 PIP503 PIP601 PIP602 PIP603 B B

GND GND COVR1VR1 3 2 PIC2702 PIC402 PIVR103 IN OUT PIVR102 PIC502 1833825 PIC602 PIC702 3.3V COC27C27 COC4C4 GND COC5C5 PID101 PIC2802 COC6C6 COC7C7 PIC2701 4.7uF PIC401 1210 PIC501 1210 COD1D1 C2COC288 PIC601 1210 PIC701 1210 2310265 100nF PIVR1011 100nF LED2 PIC2801 10uF 100nF 100nF ADP3339AKCZ-3.3-R7 PID102 2322091 10uF COP7P7 1 2 3 PW_H3 PIP701 PIP702 PIP703 863-NCP566ST12T3G GND COVR2VR2 47 GND 3 2 1751045 PIC2902 PIC1102 PIVR203 IN OUT PIVR202 PIC6102 PIR2501 1833825 PIC1202 PIC1302 1.2V C COC29C29 COC11C11 GND COC61C61 R2COR255 PIC3002 COC12C12 COC13C13 C PIC2901 4.7uF PIC1101 1210 PIC6101 1210 Res 1206 C3COC300 PIC1201 1210 PIC1301 1210 2310265 100nF PIVR2011 100nF PIR2502 PIC3001 10uF 100nF 100nF NCP566-1.2 PID201 10uF COD2D2 LED2 PID202 2322091

GND

Title D D Power supply

Size Number Revision A4 Date: 11/25/2014 Sheet of File: C:\Users\..\Power_supply.SchDoc Drawn By: 1 2 3 4 1 2 3 4

A A

COSPDSPD 11 14 PISPD011 DIN1 DOUT1 PISPD014 12 13 PISPD012 ROUT1 RIN1 PISPD013 10 7 PISPD010 DIN2 DOUT2 PISPD07 9 8 PISPD09 ROUT2 RIN2 PISPD08 1 4 PISPD01 C1+ C2+ PISPD04 PIC3102COC31C31 PIC3202 CCOC3232 PISPD033 PISPD055 PIC3101 Ciovdd_5 C1- C2- PIC3201 Ciovdd_5 100nF 2 6 100nF PISPD02 V+ V- PISPD06 PIC3302COC33C33 Pins 4,5,1,6 will supposedly be disabled, revise current solution 16 15 All signals below are what they appear to be B PISPD016 VCC GND PISPD015 B PIC3402 COC34C34 PIC3301 Ciovdd_5 on the other side of connection system ground MAX3232CDBE4 100nF PIC3401 Ciovdd_5 100nF GND SP_COSP0CONCON

COSP0HSP_H PIC3502COC35C35 PISP0CON055 ring indicator 9 PISP0H01 1 2 PISP0H02 PISP0CON09 3.3V PIC3501 Ciovdd_5 data terminal ready 4 PISP0H03 3 4 PISP0H04 PISP0CON04 Receive-data control signal NLUART00CTRL0RXUART0.CTRL_RX 100nF clear to send 8 PISP0H05 5 6 PISP0H06 GND PISP0CON08 Receive data NLUART00RXUART0.RX transmit data 3 10 PISP0H07 7 8 PISP0H08 PISP0CON03 PISP0CON010 Send-data control signal NLUART00CTRL0TXUART0.CTRL_TX request to send 7 PISP0H09 9 10 PISP0H010 PISP0CON07 Send data NLUART00TXUART0.TX receive data 2 11 PISP0H011 11 12 PISP0H012 PISP0CON02 PISP0CON011 data set ready 6 PISP0H013 13 14 PISP0H014 PISP0CON06 (data) carrier detect 1 PISP0H015 15 16 PISP0H016 PISP0CON01 Header 8X2 D Connector 9

C C

UART0 TX RX NLUART0UART0 GND CTRL_RX POUART0POUART00CTRL0RXPOUART00CTRL0TXPOUART00RXPOUART00TXUART0 CTRL_TX

Title D Serial Port D

Size Number Revision v1 A Date: 11/25/2014 Sheet of File: C:\Users\..\Serial.SchDoc Drawn By: 1 2 3 4 1 2 3 4

MEM_COMEM0BUS0JUMPERBUS_JUMPER NLI0MEMORY0BUS0I_MEMORY.BUS0 NLMEMORY0BUS0MEMORY.BUS0 PIMEM0BUS0JUMPER01 1 2 PIMEM0BUS0JUMPER02 NLI0MEMORY0BUS1I_MEMORY.BUS1 NLMEMORY0BUS1MEMORY.BUS1 PIMEM0BUS0JUMPER03 3 4 PIMEM0BUS0JUMPER04 NLI0MEMORY0BUS2I_MEMORY.BUS2 NLMEMORY0BUS2MEMORY.BUS2 PIMEM0BUS0JUMPER05 5 6 PIMEM0BUS0JUMPER06 NLI0MEMORY0BUS3I_MEMORY.BUS3 NLMEMORY0BUS3MEMORY.BUS3 PIMEM0BUS0JUMPER07 7 8 PIMEM0BUS0JUMPER08 NLI0MEMORY0BUS4I_MEMORY.BUS4 NLMEMORY0BUS4MEMORY.BUS4 A PIMEM0BUS0JUMPER09 9 10 PIMEM0BUS0JUMPER010 A MEMORY NLI0MEMORY0BUS5I_MEMORY.BUS5 NLMEMORY0BUS5MEMORY.BUS5 PIMEM0BUS0JUMPER011 11 12 PIMEM0BUS0JUMPER012 NLI0MEMORY0BUS6I_MEMORY.BUS6 NLMEMORY0BUS6MEMORY.BUS6 WRITE PIMEM0BUS0JUMPER013 13 14 PIMEM0BUS0JUMPER014 NLI0MEMORY0BUS7I_MEMORY.BUS7 NLMEMORY0BUS7MEMORY.BUS7 ENABLE PIMEM0BUS0JUMPER015 15 16 PIMEM0BUS0JUMPER016 NLMEMORYMEMORY NLI0MEMORY0BUS8I_MEMORY.BUS8 NLMEMORY0BUS8MEMORY.BUS8 POMEMORYPOMEMORY0ADDR0POMEMORY0ADDR1POMEMORY0ADDR2POMEMORY0ADDR3POMEMORY0ADDR4POMEMORY0ADDR5POMEMORY0ADDR6POMEMORY0ADDR7POMEMORY0ADDR8POMEMORY0ADDR9POMEMORY0ADDR10POMEMORY0ADDR11POMEMORY0ADDR12POMEMORY0ADDR13POMEMORY0ADDR14POMEMORY0ADDR15POMEMORY0ADDR16POMEMORY0ADDR17POMEMORY0ADDR18POMEMORY0ADDR0000180POMEMORY0BUS0POMEMORY0BUS1POMEMORY0BUS2POMEMORY0BUS3POMEMORY0BUS4POMEMORY0BUS5POMEMORY0BUS6POMEMORY0BUS7POMEMORY0BUS8POMEMORY0BUS9POMEMORY0BUS10POMEMORY0BUS11POMEMORY0BUS12POMEMORY0BUS13POMEMORY0BUS14POMEMORY0BUS15POMEMORY0BUS0000150POMEMORY0ENABLEPOMEMORY0LBPOMEMORY0UBPOMEMORY0WRITEMEMORY PIMEM0BUS0JUMPER017 17 18 PIMEM0BUS0JUMPER018 NLI0MEMORY0BUS9I_MEMORY.BUS9 NLMEMORY0BUS9MEMORY.BUS9 BUS[0..15] PIMEM0BUS0JUMPER019 19 20 PIMEM0BUS0JUMPER020 NLI0MEMORY0BUS10I_MEMORY.BUS10 NLMEMORY0BUS10MEMORY.BUS10 ADDR[0..18] PIMEM0BUS0JUMPER021 21 22 PIMEM0BUS0JUMPER022 NLI0MEMORY0BUS11I_MEMORY.BUS11 NLMEMORY0BUS11MEMORY.BUS11 UB PIMEM0BUS0JUMPER023 23 24 PIMEM0BUS0JUMPER024 NLI0MEMORY0BUS12I_MEMORY.BUS12 NLMEMORY0BUS12MEMORY.BUS12 LB PIMEM0BUS0JUMPER025 25 26 PIMEM0BUS0JUMPER026 NLI0MEMORY0BUS13I_MEMORY.BUS13 NLMEMORY0BUS13MEMORY.BUS13 PIMEM0BUS0JUMPER027 27 28 PIMEM0BUS0JUMPER028 NLI0MEMORY0BUS14I_MEMORY.BUS14 NLMEMORY0BUS14MEMORY.BUS14 PIMEM0BUS0JUMPER029 29 30 PIMEM0BUS0JUMPER030 NLI0MEMORY0BUS15I_MEMORY.BUS15 NLMEMORY0BUS15MEMORY.BUS15 PIMEM0BUS0JUMPER031 31 32 PIMEM0BUS0JUMPER032 Header 16X2

B MEM_COMEM0ADDR0JUMPERADDR_JUMPER B NLI0MEMORY0ADDR0I_MEMORY.ADDR0 MEMORNLMEMORY0ADDR0Y.ADDR0 PIMEM0ADDR0JUMPER01 1 2 PIMEM0ADDR0JUMPER02 NLI0MEMORY0ADDR1I_MEMORY.ADDR1 MEMORNLMEMORY0ADDR1Y.ADDR1 PIMEM0ADDR0JUMPER03 3 4 PIMEM0ADDR0JUMPER04 NLI0MEMORY0ADDR2I_MEMORY.ADDR2 MEMORNLMEMORY0ADDR2Y.ADDR2 PIMEM0ADDR0JUMPER05 5 6 PIMEM0ADDR0JUMPER06 NLI0MEMORY0ADDR3I_MEMORY.ADDR3 MEMORNLMEMORY0ADDR3Y.ADDR3 2103743 PIMEM0ADDR0JUMPER07 7 8 PIMEM0ADDR0JUMPER08 NLI0MEMORY0ADDR4I_MEMORY.ADDR4 MEMORNLMEMORY0ADDR4Y.ADDR4 COMEMORYMEMORY PIMEM0ADDR0JUMPER09 9 10 PIMEM0ADDR0JUMPER010 40 I_MEMORY.UB NLI0MEMORY0ADDR5I_MEMORY.ADDR5 MEMORNLMEMORY0ADDR5Y.ADDR5 UB# PIMEMORY040 PIMEM0ADDR0JUMPER011 11 12 PIMEM0ADDR0JUMPER012 39 I_MEMORY.LB NLI0MEMORY0ADDR6I_MEMORY.ADDR6 MEMORNLMEMORY0ADDR6Y.ADDR6 LB# PIMEMORY039 PIMEM0ADDR0JUMPER013 13 14 PIMEM0ADDR0JUMPER014 41 NLI0MEMORY0ADDR7I_MEMORY.ADDR7 MEMORNLMEMORY0ADDR7Y.ADDR7 OE# PIMEMORY041 GND PIMEM0ADDR0JUMPER015 15 16 PIMEM0ADDR0JUMPER016 I_MEMORY.BUS15 38 17 I_MEMORY.WRITE NLI0MEMORY0ADDR8I_MEMORY.ADDR8 MEMORNLMEMORY0ADDR8Y.ADDR8 PIMEMORY038 DQ15 WE# PIMEMORY017 PIMEM0ADDR0JUMPER017 17 18 PIMEM0ADDR0JUMPER018 I_MEMORY.BUS14 37 6 I_MEMORY.ENABLE NLI0MEMORY0ADDR9I_MEMORY.ADDR9 MEMORNLMEMORY0ADDR9Y.ADDR9 PIMEMORY037 DQ14 CE# PIMEMORY06 PIMEM0ADDR0JUMPER019 19 20 PIMEM0ADDR0JUMPER020 I_MEMORY.BUS13 36 NLI0MEMORY0ADDR10I_MEMORY.ADDR10 MEMORNLMEMORY0ADDR10Y.ADDR10 PIMEMORY036 DQ13 PIMEM0ADDR0JUMPER021 21 22 PIMEM0ADDR0JUMPER022 I_MEMORY.BUS12 35 NLI0MEMORY0ADDR11I_MEMORY.ADDR11 MEMORNLMEMORY0ADDR11Y.ADDR11 PIMEMORY035 DQ12 PIMEM0ADDR0JUMPER023 23 24 PIMEM0ADDR0JUMPER024 I_MEMORY.BUS11 32 18 I_MEMORY.ADDR18 NLI0MEMORY0ADDR12I_MEMORY.ADDR12 MEMORNLMEMORY0ADDR12Y.ADDR12 PIMEMORY032 DQ11 A18 PIMEMORY018 PIMEM0ADDR0JUMPER025 25 26 PIMEM0ADDR0JUMPER026 I_MEMORY.BUS10 31 19 I_MEMORY.ADDR17 NLI0MEMORY0ADDR13I_MEMORY.ADDR13 MEMORNLMEMORY0ADDR13Y.ADDR13 PIMEMORY031 DQ10 A17 PIMEMORY019 PIMEM0ADDR0JUMPER027 27 28 PIMEM0ADDR0JUMPER028 I_MEMORY.BUS9 30 20 I_MEMORY.ADDR16 NLI0MEMORY0ADDR14I_MEMORY.ADDR14 MEMORNLMEMORY0ADDR14Y.ADDR14 PIMEMORY030 DQ9 A16 PIMEMORY020 PIMEM0ADDR0JUMPER029 29 30 PIMEM0ADDR0JUMPER030 I_MEMORY.BUS8 29 21 I_MEMORY.ADDR15 NLI0MEMORY0ADDR15I_MEMORY.ADDR15 MEMORNLMEMORY0ADDR15Y.ADDR15 PIMEMORY029 DQ8 A15 PIMEMORY021 PIMEM0ADDR0JUMPER031 31 32 PIMEM0ADDR0JUMPER032 22 I_MEMORY.ADDR14 NLI0MEMORY0ADDR16I_MEMORY.ADDR16 MEMORNLMEMORY0ADDR16Y.ADDR16 A14 PIMEMORY022 PIMEM0ADDR0JUMPER033 33 34 PIMEM0ADDR0JUMPER034 I_MEMORY.BUS7 16 23 I_MEMORY.ADDR13 NLI0MEMORY0ADDR17I_MEMORY.ADDR17 MEMORNLMEMORY0ADDR17Y.ADDR17 PIMEMORY016 DQ7 A13 PIMEMORY023 PIMEM0ADDR0JUMPER035 35 36 PIMEM0ADDR0JUMPER036 C I_MEMORY.BUS6 15 24 I_MEMORY.ADDR12 NLI0MEMORY0ADDR18I_MEMORY.ADDR18 MEMORNLMEMORY0ADDR18Y.ADDR18 C PIMEMORY015 DQ6 A12 PIMEMORY024 PIMEM0ADDR0JUMPER037 37 38 PIMEM0ADDR0JUMPER038 I_MEMORY.BUS5 14 25 I_MEMORY.ADDR11 PIMEMORY014 DQ5 A11 PIMEMORY025 I_MEMORY.BUS4 13 26 I_MEMORY.ADDR10 Header 19X2 PIMEMORY013 DQ4 A10 PIMEMORY026 I_MEMORY.BUS3 10 27 I_MEMORY.ADDR9 PIMEMORY010 DQ3 A9 PIMEMORY027 I_MEMORY.BUS2 9 28 I_MEMORY.ADDR8 PIMEMORY09 DQ2 A8 PIMEMORY028 I_MEMORY.BUS1 8 42 I_MEMORY.ADDR7 PIMEMORY08 DQ1 A7 PIMEMORY042 I_MEMORY.BUS0 7 43 I_MEMORY.ADDR6 PIMEMORY07 DQ0 A6 PIMEMORY043 44 I_MEMORY.ADDR5 A5 PIMEMORY044 1 I_MEMORY.ADDR4 A4 PIMEMORY01 12 2 I_MEMORY.ADDR3 MEM_COMEM0CONTROL0JUMPERCONTROL_JUMPER GND PIMEMORY012 Vss A3 PIMEMORY02 34 3 I_MEMORY.ADDR2 NLI0MEMORY0UBI_MEMORY.UB MEMORNLMEMORY0UBY.UB PIMEMORY034 Vss A2 PIMEMORY03 PIMEM0CONTROL0JUMPER01 1 2 PIMEM0CONTROL0JUMPER02 33 4 I_MEMORY.ADDR1 NLI0MEMORY0LBI_MEMORY.LB MEMORNLMEMORY0LBY.LB PIMEMORY033 Vcc A1 PIMEMORY04 PIMEM0CONTROL0JUMPER03 3 4 PIMEM0CONTROL0JUMPER04 11 5 I_MEMORY.ADDR0 NLI0MEMORY0WRITEI_MEMORY.WRITE MEMORNLMEMORY0WRITEY.WRITE PIMEMORY011 Vcc A0 PIMEMORY05 PIMEM0CONTROL0JUMPER05 5 6 PIMEM0CONTROL0JUMPER06 NLI0MEMORY0ENABLEI_MEMORY.ENABLE MEMORNLMEMORY0ENABLEY.ENABLE PIMEM0CONTROL0JUMPER07 7 8 PIMEM0CONTROL0JUMPER08 3.3V SRAM-AS7C38098A Header 4X2

Title D Memory D

Size Number Revision version 1 A Date: 11/25/2014 Sheet of File: C:\Users\..\SRAM.SchDoc Drawn By: 1 2 3 4 1 2 3 4

A A

2322091 DCOD33 2380683 NLMCU0LEDS0LED01 COR1R1 MCU_LEDS.LED_1 PIR101 PIR102PID301 PID302 B Res3 B 120 LED2 MCU_LEDS 2322091 DCOD44 LED_1 NLMCU0LEDSMCU_LEDS POMCU0LEDSPOMCU0LEDS0LED01POMCU0LEDS0LED02MCU_LEDS 2380683 LED_2 NLMCU0LEDS0LED02 COR2R2 MCU_LEDS.LED_2 PIR201 PIR202PID401 PID402 Res3 120 LED2 FPGA_LEDS 2322091 GND DCOD55 LED_1 FPGNLFPGA0LEDSA_LEDS POFPGA0LEDSPOFPGA0LEDS0LED01POFPGA0LEDS0LED02FPGA_LEDS 2380683 LED_2 COR3R3 NLFPGA0LEDS0LED01FPGA_LEDS.LED_1 PIR301 PIR302PID501 PID502 Res3 LED2 120 2322091 DCOD66 2380683 NLFPGA0LEDS0LED02 COR4R4 FPGA_LEDS.LED_2 PIR401 PIR402PID601 PID602 Res3 C 120 LED2 C

GND

Title D D LEDs

Size Number Revision A4 Date: 11/25/2014 Sheet of File: C:\Users\..\Status_LEDS.SchDoc Drawn By: 1 2 3 4 1 2 3 4

VGA Power_regulator USB_power VGA.SchDoc Power_supply.SchDoc USB_power.SchDoc 5V_TO_REG 5V_TO_REG Serial Serial.SchDoc VGA-I/O A A UART0 USB_data USB_data.SchDoc MCU MCU.SchDoc D+ VGA S

VGA.SchDoc usb_dp D- U B 0 V O

/ usb_dm T I - R A A G G

VGA-I/O U I V S _ E

POWER SUPPLY 3 4 O N I I MCU_LEDS 7 8 1 1 P O B B B B B G D E P P P P SPI MCU_IO MCU_IO.SchDoc

LEDs

Status_LEDS.SchDoc S U

HDMI B

B V B MCU_LEDS HDMI.SchDoc

FPGA_LEDS HDMI

FPGA_BANK_0 FPGA_IO_1.SchDoc I O

FPGA_BANK_1 / M I FPGA_BANK_3 - FPGA_IO_2.SchDoc D A

H FPGA_IO_4.SchDoc G V Data_mem1 FPGA_LEDS Data_mem2 SRAM.SchDoc SRAM.SchDoc

C C MEMORY DATA_0 MEMORY DATA_1 FPGA_BANK_2 FPGA_IO_3.SchDoc O I I I k P l B P S C G E

JTAG FPGA_CLK_SUSPEND.SchDoc DONE_SIG FPGA ENLEBIBI OSCILLATOR FPGA_VCC FPGA_GND MCU_oscilliation.SchDoc FPGA_VCC.SchDoc FPGA_GND.SchDoc PB7

k Title D l D

C PB8 Top Level Schematics PB13 Size Number Revision PB14 v.1 A4 Date: 11/25/2014 Sheet of File: C:\Users\..\Top_level.SchDoc Drawn By: 1 2 3 4 1 2 3 4

A A

VBUS B COL1L1 B 1 2 POVBUSPOVBUS0PVBUS P PIL101 PIL102 742792641

COUSB0DATAUSB_DATA 5 4 UCOUSB0ESDSB_ESD D- PIUSB0ESD05 PIUSB0ESD04 2380698 1 #NAME? COR5 PIUSB0DATA01 VBUS 1 COL2L2 2 R5 2 6 POD0D- P PIL201 PIL202 2285403 PIR501 PIR502 PIUSB0DATA02 D- MP1 PIUSB0DATA06 1 3 7 PIUSB0ESD01 R_usb_dm PIUSB0DATA03 D+ MP2 PIUSB0DATA07 D+ 4 8 15 PIUSB0DATA04 ID MP3 PIUSB0DATA08 4 3 PIUSB0ESD02 PIUSB0ESD03 5 9 POD0POD00PD+ P PIL204 PIL203 2 3 2380698 PIUSB0DATA05 GND MP4 PIUSB0DATA09 744232161 COR6R6 PIR601 PIR602 67503-1020 GND

R_usb_dp 2 2 15 PIL302 PIL402 LCOL33 LCOL44 742792641 742792641

PIL301 PIL401 C 1 1 C

GND GND

Title D D USB data

Size Number Revision A4 Date: 11/25/2014 Sheet of File: C:\Users\..\USB_data.SchDoc Drawn By: 1 2 3 4 1 2 3 4

A A G E R _ O T _

PO5V0TO0REGPO5V0TO0REG0PV 5

5V_TO_REG P B B

COUSB0POWERUSB_POWER 1 PIUSB0POWER01 VBUS 2 6 PIUSB0POWER02 D- MP1 PIUSB0POWER06 3 7 PIUSB0POWER03 D+ MP2 PIUSB0POWER07 4 8 PIUSB0POWER04 ID MP3 PIUSB0POWER08 5 9 PIUSB0POWER05 GND MP4 PIUSB0POWER09 67503-1020

PIL5022 PIL6022 LCOL55 LCOL66 742792641 742792641 C C PIL501 PIL601 1 1

GND GND

Title D D USB power

Size Number Revision 1 A4 Date: 11/25/2014 Sheet of File: C:\Users\..\USB_power.SchDoc Drawn By: Kristian F. Normann 1 2 3 4 1 2 3 4

A A

COJ1J1 VNLVGA0I0O0J10p15GA-I/O.J1_p15 PIJ101 1 VNLVGA0I0O0J10p14GA-I/O.J1_p14 PIJ102 2 VNLVGA0I0O0J10p13GA-I/O.J1_p13 PIJ103 3 VNLVGA0I0O0J10p12GA-I/O.J1_p12 PIJ104 4 VNLVGA0I0O0J10p11GA-I/O.J1_p11 PIJ105 5 VNLVGA0I0O0J10p10GA-I/O.J1_p10 PIJ106 6 VNLVGA0I0O0J10p9GA-I/O.J1_p9 B PIJ107 7 B VGA-I/O VNLVGA0I0O0J10p8GA-I/O.J1_p8 PIJ108 8 VNLVGA0I0O0J10p2GA-I/O.J1_p2 J2_p7 PIJ109 9 VNLVGA0I0O0J10p3GA-I/O.J1_p3 J2_p5 PIJ1010 10 5V line by cable if needed. Imponator be damned J2_p4 PIJ1011 11 J2_p6 Header 11H J2_p0 J2_p1 NLVGA0I0OVGA-I/O COJ2J2 POVGA0I0OPOVGA0I0O0J10P2POVGA0I0O0J10P3POVGA0I0O0J10P8POVGA0I0O0J10P9POVGA0I0O0J10P10POVGA0I0O0J10P11POVGA0I0O0J10P12POVGA0I0O0J10P13POVGA0I0O0J10P14POVGA0I0O0J10P15POVGA0I0O0J20P0POVGA0I0O0J20P1POVGA0I0O0J20P4POVGA0I0O0J20P5POVGA0I0O0J20P6POVGA0I0O0J20P7VGA-I/O J1_p15 VNLVGA0I0O0J20p7GA-I/O.J2_p7 J1_p14 PIJ201 1 VNLVGA0I0O0J20p5GA-I/O.J2_p5 J1_p13 PIJ202 2 VNLVGA0I0O0J20p4GA-I/O.J2_p4 J1_p12 PIJ203 3 VNLVGA0I0O0J20p6GA-I/O.J2_p6 J1_p11 PIJ204 4 J1_p10 PIJ205 5 VNLVGA0I0O0J20p0GA-I/O.J2_p0 J1_p9 PIJ206 6 VNLVGA0I0O0J20p1GA-I/O.J2_p1 J1_p8 PIJ207 7 J1_p2 PIJ208 8 J1_p3 PIJ209 9 PIJ2010 10 COP8P8 C PIJ2011 11 C PIP801 1 2 PIP802 Header 11H PIP803 3 4 PIP804 Header 2X2

PIP902 PIP901 GND 2 1

COP9P9 Emergency 3.3V supply at most 400mA

Title D D VGA header

Size Number Revision A4 Date: 11/25/2014 Sheet of File: C:\Users\..\VGA.SchDoc Drawn By: 1 2 3 4 CO COFPGA0RESET COD5 COR7 COR4 COD6 COR3 COP4 COP3 COP2 COP1 COJTAG PAFPGA0RESET03 PAFPGA0RESET04 PAR402 PAD602 PAR302 PAD502 PAR701 PAR702 PAP403 PAP302 PAP102 PAJTAG09 PAJTAG07 PAJTAG05 PAJTAG03 PAJTAG01 PAD501 PAP204 PAP203 PAR401 PAD601 PAR301 PA0 PAP402 PAP301 PAP101 PAJTAG010 PAJTAG08 PAJTAG06 PAJTAG04 PAJTAG02 PAFPGA0RESET01 PAFPGA0RESET02 PA01PA02PA03PA04PA05PA06PA07PA08PA09PA010PA011PA012PA013PA014PA015PA016PA017PA018PA019 PAP202 PAP201 PAP401 COP90VGA2 COP80VGA2 PAP80VGA202 PAP80VGA204 COHDMI0header COR8 PAP90VGA201 PAP90VGA202 COMEM20DATA PAP80VGA201 PAP80VGA203 PAHDMI0header02 PAHDMI0header04 PAHDMI0header06 PAHDMI0header08 PAHDMI0header010 PAHDMI0header012 PAHDMI0header014 PAHDMI0header016 PAR801 PAR802 COMEM10DATA PAMEM20DATA02 PAMEM20DATA04 PAMEM20DATA06 PAMEM20DATA08 PAMEM20DATA010 PAMEM20DATA012 PAMEM20DATA014 PAMEM20DATA016 PAMEM20DATA018 PAMEM20DATA020 PAMEM20DATA022 PAMEM20DATA024 PAMEM20DATA026 PAMEM20DATA028 PAMEM20DATA030 PAMEM20DATA032 PAHDMI0header01 PAHDMI0header03 PAHDMI0header05 PAHDMI0header07 PAHDMI0header09 PAHDMI0header011 PAHDMI0header013 PAHDMI0header015 PAMEM10DATA031 PAMEM10DATA029 PAMEM10DATA027 PAMEM10DATA025 PAMEM10DATA023 PAMEM10DATA021 PAMEM10DATA019 PAMEM10DATA017 PAMEM10DATA015 PAMEM10DATA013 PAMEM10DATA011 PAMEM10DATA09 PAMEM10DATA07 PAMEM10DATA05 PAMEM10DATA03 PAMEM10DATA01 COJ20VGA2 PAMEM20DATA01 PAMEM20DATA03 PAMEM20DATA05 PAMEM20DATA07 PAMEM20DATA09 PAMEM20DATA011 PAMEM20DATA013 PAMEM20DATA015 PAMEM20DATA017 PAMEM20DATA019 PAMEM20DATA021 PAMEM20DATA023 PAMEM20DATA025 PAMEM20DATA027 PAMEM20DATA029 PAMEM20DATA031 PAJ20VGA201 PAJ20VGA202 PAJ20VGA203 PAJ20VGA204 PAJ20VGA205 PAJ20VGA206 PAJ20VGA207 PAJ20VGA208 PAJ20VGA209 PAJ20VGA2010 PAJ20VGA2011 PAMEM10DATA032 PAMEM10DATA030 PAMEM10DATA028 PAMEM10DATA026 PAMEM10DATA024 PAMEM10DATA022 PAMEM10DATA020 PAMEM10DATA018 PAMEM10DATA016 PAMEM10DATA014 PAMEM10DATA012 PAMEM10DATA010 PAMEM10DATA08 PAMEM10DATA06 PAMEM10DATA04 PAMEM10DATA02 COC46 COC52 PAC4602 PAC5202 PAC5201 COC51 COC50 COC49 COC48 COC43 PAC5102 PAC5002 PAC4901 PAC4801 PAC4302 PAC4601 COC36 PAC5101 PAC5001 PAC4902 PAC4802 PAC4301 COMEMORY2 COC45 PAC3601 PAC3602 COMEMORY1 PAC4502 COFPGA COC53 COC42 PAMEMORY2023 PAMEMORY2022 COC37 PAC4202 PAMEMORY1023 PAMEMORY1022 PAMEMORY2024 PAMEMORY2021 PAC5301 PAC5302 PAMEMORY1024 PAMEMORY1021 PAMEMORY2025 PAMEMORY2020 PAC4501 PAFPGA0A1 PAFPGA0A2 PAFPGA0A3 PAFPGA0A4 PAFPGA0A5 PAFPGA0A6 PAFPGA0A7 PAFPGA0A8 PAFPGA0A9 PAFPGA0A10 PAFPGA0A11 PAFPGA0A12 PAFPGA0A13 PAFPGA0A14 PAFPGA0A15 PAFPGA0A16 PAFPGA0A17 PAFPGA0A18 PAMEMORY1025 PAMEMORY1020 PAMEMORY2026 PAMEMORY2019 PAFPGA0B1 PAFPGA0B2 PAFPGA0B3 PAFPGA0B4 PAFPGA0B5 PAFPGA0B6 PAFPGA0B7 PAFPGA0B8 PAFPGA0B9 PAFPGA0B10 PAFPGA0B11 PAFPGA0B12 PAFPGA0B13 PAFPGA0B14 PAFPGA0B15 PAFPGA0B16 PAFPGA0B17 PAFPGA0B18 PAMEMORY1026 PAMEMORY1019 PAMEMORY2027 PAMEMORY2018 PAC3701 PAC3702 PAFPGA0C1 PAFPGA0C2 PAFPGA0C3 PAFPGA0C4 PAFPGA0C5 PAFPGA0C6 PAFPGA0C7 PAFPGA0C8 PAFPGA0C9 PAFPGA0C10 PAFPGA0C11 PAFPGA0C12 PAFPGA0C13 PAFPGA0C14 PAFPGA0C15 PAFPGA0C16 PAFPGA0C17 PAFPGA0C18 PAC4201 PAMEMORY1027 PAMEMORY1018 PAMEMORY2028 PAMEMORY2017 PAFPGA0D1 PAFPGA0D2 PAFPGA0D3 PAFPGA0D4 PAFPGA0D5 PAFPGA0D6 PAFPGA0D7 PAFPGA0D8 PAFPGA0D9 PAFPGA0D10 PAFPGA0D11 PAFPGA0D12 PAFPGA0D13 PAFPGA0D14 PAFPGA0D15 PAFPGA0D16 PAFPGA0D17 PAFPGA0D18 PAMEM10CTRL08 PAMEM10CTRL07 PAMEMORY1028 PAMEMORY1017 PAMEMORY2029 PAMEMORY2016 PAFPGA0E1 PAFPGA0E2 PAFPGA0E3 PAFPGA0E4 PAFPGA0E5 PAFPGA0E6 PAFPGA0E7 PAFPGA0E8 PAFPGA0E9 PAFPGA0E10 PAFPGA0E11 PAFPGA0E12 PAFPGA0E13 PAFPGA0E14 PAFPGA0E15 PAFPGA0E16 PAFPGA0E17 PAFPGA0E18 COC54 PAMEMORY1029 PAMEMORY1016 PAMEMORY2030 PAMEMORY2015 PAFPGA0F1 PAFPGA0F2 PAFPGA0F3 PAFPGA0F4 PAFPGA0F5 PAFPGA0F6 PAFPGA0F7 PAFPGA0F8 PAFPGA0F9 PAFPGA0F10 PAFPGA0F11 PAFPGA0F12 PAFPGA0F13 PAFPGA0F14 PAFPGA0F15 PAFPGA0F16 PAFPGA0F17 PAFPGA0F18 PAMEMORY1030 PAMEMORY1015 PAFPGA0G1 PAFPGA0G2 PAFPGA0G3 PAFPGA0G4 PAFPGA0G5 PAFPGA0G6 PAFPGA0G7 PAFPGA0G8 PAFPGA0G9 PAFPGA0G10 PAFPGA0G11 PAFPGA0G12 PAFPGA0G13 PAFPGA0G14 PAFPGA0G15 PAFPGA0G16 PAFPGA0G17 PAFPGA0G18 PAMEMORY2031 PAMEMORY2014 PAMEM20CTRL01 PAMEM20CTRL02 COC44 COC38 COC41 PAMEMORY1031 PAMEMORY1014 PAFPGA0H1 PAFPGA0H2 PAFPGA0H3 PAFPGA0H4 PAFPGA0H5 PAFPGA0H6 PAFPGA0H7 PAFPGA0H8 PAFPGA0H9 PAFPGA0H10 PAFPGA0H11 PAFPGA0H12 PAFPGA0H13 PAFPGA0H14 PAFPGA0H15 PAFPGA0H16 PAFPGA0H17 PAFPGA0H18 PAC5402 PAC5401 PAMEMORY2032 PAMEMORY2013 PAMEM10CTRL06 PAMEM10CTRL05 PAMEMORY1032 PAMEMORY1013 PAMEMORY2033 PAMEMORY2012 PAFPGA0J1 PAFPGA0J2 PAFPGA0J3 PAFPGA0J4 PAFPGA0J5 PAFPGA0J6 PAFPGA0J7 PAFPGA0J8 PAFPGA0J9 PAFPGA0J10 PAFPGA0J11 PAFPGA0J12 PAFPGA0J13 PAFPGA0J14 PAFPGA0J15 PAFPGA0J16 PAFPGA0J17 PAFPGA0J18 PAMEMORY1033 PAMEMORY1012 PAFPGA0K1 PAFPGA0K2 PAFPGA0K3 PAFPGA0K4 PAFPGA0K5 PAFPGA0K6 PAFPGA0K7 PAFPGA0K8 PAFPGA0K9 PAFPGA0K10 PAFPGA0K11 PAFPGA0K12 PAFPGA0K13 PAFPGA0K14 PAFPGA0K15 PAFPGA0K16 PAFPGA0K17 PAFPGA0K18 PAC4101 PAMEMORY2034 PAMEMORY2011 PAMEM20CTRL03 PAMEM20CTRL04 PAC4402 PAC4401 PAC3802 PAC3801 PAMEMORY1034 PAMEMORY1011 PAMEMORY2035 PAMEMORY2010 PAFPGA0L1 PAFPGA0L2 PAFPGA0L3 PAFPGA0L4 PAFPGA0L5 PAFPGA0L6 PAFPGA0L7 PAFPGA0L8 PAFPGA0L9 PAFPGA0L10 PAFPGA0L11 PAFPGA0L12 PAFPGA0L13 PAFPGA0L14 PAFPGA0L15 PAFPGA0L16 PAFPGA0L17 PAFPGA0L18 PAMEM10CTRL04 PAMEM10CTRL03 PAMEMORY1035 PAMEMORY1010 PAMEMORY2036 PAMEMORY209 PAFPGA0M1 PAFPGA0M2 PAFPGA0M3 PAFPGA0M4 PAFPGA0M5 PAFPGA0M6 PAFPGA0M7 PAFPGA0M8 PAFPGA0M9 PAFPGA0M10 PAFPGA0M11 PAFPGA0M12 PAFPGA0M13 PAFPGA0M14 PAFPGA0M15 PAFPGA0M16 PAFPGA0M17 PAFPGA0M18 COC55 PAMEMORY1036 PAMEMORY109 PAMEMORY2037 PAMEMORY208 PAFPGA0N1 PAFPGA0N2 PAFPGA0N3 PAFPGA0N4 PAFPGA0N5 PAFPGA0N6 PAFPGA0N7 PAFPGA0N8 PAFPGA0N9 PAFPGA0N10 PAFPGA0N11 PAFPGA0N12 PAFPGA0N13 PAFPGA0N14 PAFPGA0N15 PAFPGA0N16 PAFPGA0N17 PAFPGA0N18 COMEM10CTRL PAMEMORY1037 PAMEMORY108 PAMEM20CTRL05 PAMEM20CTRL06 PAC4102 PAMEMORY2038 PAMEMORY207 COC40 COC39 PAFPGA0P1 PAFPGA0P2 PAFPGA0P3 PAFPGA0P4 PAFPGA0P5 PAFPGA0P6 PAFPGA0P7 PAFPGA0P8 PAFPGA0P9 PAFPGA0P10 PAFPGA0P11 PAFPGA0P12 PAFPGA0P13 PAFPGA0P14 PAFPGA0P15 PAFPGA0P16 PAFPGA0P17 PAFPGA0P18 PAMEM10CTRL02 PAMEM10CTRL01 PAMEMORY1038 PAMEMORY107 PAMEMORY2039 PAMEMORY206 COMEM20CTRL PAFPGA0R1 PAFPGA0R2 PAFPGA0R3 PAFPGA0R4 PAFPGA0R5 PAFPGA0R6 PAFPGA0R7 PAFPGA0R8 PAFPGA0R9 PAFPGA0R10 PAFPGA0R11 PAFPGA0R12 PAFPGA0R13 PAFPGA0R14 PAFPGA0R15 PAFPGA0R16 PAFPGA0R17 PAFPGA0R18 PAC5502 PAC5501 PAMEMORY1039 PAMEMORY106 COJ10VGA2 PAMEMORY2040 PAMEMORY205 PAFPGA0T1 PAFPGA0T2 PAFPGA0T3 PAFPGA0T4 PAFPGA0T5 PAFPGA0T6 PAFPGA0T7 PAFPGA0T8 PAFPGA0T9 PAFPGA0T10 PAFPGA0T11 PAFPGA0T12 PAFPGA0T13 PAFPGA0T14 PAFPGA0T15 PAFPGA0T16 PAFPGA0T17 PAFPGA0T18 PAMEMORY1040 PAMEMORY105 PAMEM20CTRL07 PAMEM20CTRL08 PAFPGA0U1 PAFPGA0U2 PAFPGA0U3 PAFPGA0U4 PAFPGA0U5 PAFPGA0U6 PAFPGA0U7 PAFPGA0U8 PAFPGA0U9 PAFPGA0U10 PAFPGA0U11 PAFPGA0U12 PAFPGA0U13 PAFPGA0U14 PAFPGA0U15 PAFPGA0U16 PAFPGA0U17 PAFPGA0U18 PAMEMORY2041 PAMEMORY204 PAMEMORY1041 PAMEMORY104 PAC4002 PAC4001 PAC3902 PAC3901 PAFPGA0V1 PAFPGA0V2 PAFPGA0V3 PAFPGA0V4 PAFPGA0V5 PAFPGA0V6 PAFPGA0V7 PAFPGA0V8 PAFPGA0V9 PAFPGA0V10 PAFPGA0V11 PAFPGA0V12 PAFPGA0V13 PAFPGA0V14 PAFPGA0V15 PAFPGA0V16 PAFPGA0V17 PAFPGA0V18 PAMEMORY2042 PAMEMORY203 PAMEMORY1042 PAMEMORY103 PAJ10VGA201 PAJ10VGA202 PAJ10VGA203 PAJ10VGA204 PAJ10VGA205 PAJ10VGA206 PAJ10VGA207 PAJ10VGA208 PAJ10VGA209 PAJ10VGA2010 PAJ10VGA2011 PAMEMORY2043 PAMEMORY202 COC56 PAMEMORY1043 PAMEMORY102 PAMEMORY2044 PAMEMORY201 COC60 COC59 COC58 COC57 PAMEMORY1044 PAMEMORY101 COC26 COCLK0JMPPAC5602 PAC5601 PAC6002 PAC5902 PAC5801 PAC5701 PAC2601 PAC2602 PACLK0JMP02 PAC6001 PAC5901 PAC5802 PAC5702 COC47 COFPGA0CLK PACLK0JMP01 PAC4701 PAC4702 COMEM20ADDR COMEM10ADDR PAFPGA0CLK04 PAFPGA0CLK03 PAMEM20ADDR037 PAMEM20ADDR035 PAMEM20ADDR033 PAMEM20ADDR031 PAMEM20ADDR029 PAMEM20ADDR027 PAMEM20ADDR025 PAMEM20ADDR023 PAMEM20ADDR021 PAMEM20ADDR019 PAMEM20ADDR017 PAMEM20ADDR015 PAMEM20ADDR013 PAMEM20ADDR011 PAMEM20ADDR09 PAMEM20ADDR07 PAMEM20ADDR05 PAMEM20ADDR03 PAMEM20ADDR01 COC25 PAMEM10ADDR02 PAMEM10ADDR04 PAMEM10ADDR06 PAMEM10ADDR08 PAMEM10ADDR010 PAMEM10ADDR012 PAMEM10ADDR014 PAMEM10ADDR016 PAMEM10ADDR018 PAMEM10ADDR020 PAMEM10ADDR022 PAMEM10ADDR024 PAMEM10ADDR026 PAMEM10ADDR028 PAMEM10ADDR030 PAMEM10ADDR032 PAMEM10ADDR034 PAMEM10ADDR036 PAMEM10ADDR038 PAMEM20ADDR038 PAMEM20ADDR036 PAMEM20ADDR034 PAMEM20ADDR032 PAMEM20ADDR030 PAMEM20ADDR028 PAMEM20ADDR026 PAMEM20ADDR024 PAMEM20ADDR022 PAMEM20ADDR020 PAMEM20ADDR018 PAMEM20ADDR016 PAMEM20ADDR014 PAMEM20ADDR012 PAMEM20ADDR010 PAMEM20ADDR08 PAMEM20ADDR06 PAMEM20ADDR04 PAMEM20ADDR02 PAFPGA0CLK01 PAFPGA0CLK02 PAC2502 PAC2501 PAMEM10ADDR01 PAMEM10ADDR03 PAMEM10ADDR05 PAMEM10ADDR07 PAMEM10ADDR09 PAMEM10ADDR011 PAMEM10ADDR013 PAMEM10ADDR015 PAMEM10ADDR017 PAMEM10ADDR019 PAMEM10ADDR021 PAMEM10ADDR023 PAMEM10ADDR025 PAMEM10ADDR027 PAMEM10ADDR029 PAMEM10ADDR031 PAMEM10ADDR033 PAMEM10ADDR035 PAMEM10ADDR037 COGPIO PAGPIO07 PAGPIO05 PAGPIO03 PAGPIO01 COEBI0DATA COEBI0address PAGPIO08 PAGPIO06 PAGPIO04 PAGPIO02 PAEBI0DATA031 PAEBI0DATA029 PAEBI0DATA027 PAEBI0DATA025 PAEBI0DATA023 PAEBI0DATA021 PAEBI0DATA019 PAEBI0DATA017 PAEBI0DATA015 PAEBI0DATA013 PAEBI0DATA011 PAEBI0DATA09 PAEBI0DATA07 PAEBI0DATA05 PAEBI0DATA03 PAEBI0DATA01 PAEBI0address059 PAEBI0address057 PAEBI0address055 PAEBI0address053 PAEBI0address051 PAEBI0address049 PAEBI0address047 PAEBI0address045 PAEBI0address043 PAEBI0address041 PAEBI0address039 PAEBI0address037 PAEBI0address035 PAEBI0address033 PAEBI0address031 PAEBI0address029 PAEBI0address027 PAEBI0address025 PAEBI0address023 PAEBI0address021 PAEBI0address019 PAEBI0address017 PAEBI0address015 PAEBI0address013 PAEBI0address011 PAEBI0address09 PAEBI0address07 PAEBI0address05 PAEBI0address03 PAEBI0address01 COR1 PAEBI0DATA032 PAEBI0DATA030 PAEBI0DATA028 PAEBI0DATA026 PAEBI0DATA024 PAEBI0DATA022 PAEBI0DATA020 PAEBI0DATA018 PAEBI0DATA016 PAEBI0DATA014 PAEBI0DATA012 PAEBI0DATA010 PAEBI0DATA08 PAEBI0DATA06 PAEBI0DATA04 PAEBI0DATA02 PAEBI0address060 PAEBI0address058 PAEBI0address056 PAEBI0address054 PAEBI0address052 PAEBI0address050 PAEBI0address048 PAEBI0address046 PAEBI0address044 PAEBI0address042 PAEBI0address040 PAEBI0address038 PAEBI0address036 PAEBI0address034 PAEBI0address032 PAEBI0address030 PAEBI0address028 PAEBI0address026 PAEBI0address024 PAEBI0address022 PAEBI0address020 PAEBI0address018 PAEBI0address016 PAEBI0address014 PAEBI0address012 PAEBI0address010 PAEBI0address08 PAEBI0address06 PAEBI0address04 PAEBI0address02 COD3 PAR101 PAR102 PAD301 PAD302 COR2 COD4 PAR202 PAR201 PAD401 PAD402 COC10 COC14 COC16 COC15 COSPI0HEADER COJ20VGA1 PAC1002 PAC1001COMCUPAC1402 PAC1401 PAC1602 PAC1601 PAC1502 PAC1501 COL1 COL4 PAJ20VGA101 PAJ20VGA102 PAJ20VGA103 PAJ20VGA104 PAJ20VGA105 PAJ20VGA106 PAJ20VGA107 PAJ20VGA108 PAJ20VGA109 PAJ20VGA1010 PAJ20VGA1011 COC21COC20 COC19 COC24 PASPI0HEADER02 PASPI0HEADER04 PASPI0HEADER06 PASPI0HEADER08 PAL101 PAL102 PAL401 PAL402 COUSB0DATA PAMCU0A1 PAMCU0A2 PAMCU0A3 PAMCU0A4 PAMCU0A5 PAMCU0A6 PAMCU0A7 PAMCU0A8 PAMCU0A9 PAMCU0A10PAMCU0A11 PAMCU0B1 PAMCU0B2 PAMCU0B3 PAMCU0B4 PAMCU0B5 PAMCU0B6 PAMCU0B7 PAMCU0B8 PAMCU0B9 PAMCU0B10PAMCU0B11 COL3 PAC2102 PAC2001PAC2101 PAC2002 PAMCU0C1 PAMCU0C2 PAMCU0C3 PAMCU0C4 PAMCU0C5 PAMCU0C6 PAMCU0C7 PAMCU0C8 PAMCU0C9 PAMCU0C10PAMCU0C11 PAC1902 PAC1901 PAC2401 PAC2402 PASPI0HEADER01 PASPI0HEADER03 PASPI0HEADER05 PASPI0HEADER07 PAUSB0DATA08PAUSB0DATA09 PAMCU0D1 PAMCU0D2 PAMCU0D3 PAMCU0D4 PAMCU0D5 PAMCU0D6 PAMCU0D7 PAMCU0D8 PAMCU0D9 PAMCU0D10PAMCU0D11 PAL301 PAL302 PAMCU0E1 PAMCU0E2 PAMCU0E3 PAMCU0E4 PAMCU0E5 PAMCU0E6 PAMCU0E7 PAMCU0E8 PAMCU0E9 PAMCU0E10PAMCU0E11 COXLOW PAMCU0F1 PAMCU0F2 PAMCU0F3 PAMCU0F4 PAMCU0F5 PAMCU0F6 PAMCU0F7 PAMCU0F8 PAMCU0F9 PAMCU0F10PAMCU0F11 COXHIGH PAUSB0DATA0X COC18 PAMCU0G1 PAMCU0G2 PAMCU0G3 PAMCU0G4 PAMCU0G5 PAMCU0G6 PAMCU0G7 PAMCU0G8 PAMCU0G9 PAMCU0G10PAMCU0G11 PAUSB0DATA05 PAMCU0H1 PAMCU0H2 PAMCU0H3 PAMCU0H4 PAMCU0H5 PAMCU0H6 PAMCU0H7 PAMCU0H8 PAMCU0H9 PAMCU0H10PAMCU0H11 COC1 COUSB0ESD PAUSB0DATA04 PAXLOW02 PAXLOW01 PAMCU0J1 PAMCU0J2 PAMCU0J3 PAMCU0J4 PAMCU0J5 PAMCU0J6 PAMCU0J7 PAMCU0J8 PAMCU0J9 PAMCU0J10PAMCU0J11 PAXHIGH02 PAUSB0DATA03 PAC1802 PAC1801 PAMCU0K1 PAMCU0K2 PAMCU0K3 PAMCU0K4 PAMCU0K5 PAMCU0K6 PAMCU0K7 PAMCU0K8 PAMCU0K9 PAMCU0K10PAMCU0K11 PAUSB0ESD01 PAUSB0ESD05 PAUSB0DATA02 PAMCU0L1 PAMCU0L2 PAMCU0L3 PAMCU0L4 PAMCU0L5 PAMCU0L6 PAMCU0L7 PAMCU0L8 PAMCU0L9 PAMCU0L10PAMCU0L11 PAC102 PAC101 PAUSB0ESD02 COR5 PAUSB0DATA01 PAUSB0DATA0Y PAXLOW03 COC17PAXLOW04 PAUSB0ESD03 PAUSB0ESD04 COC9 COC8 COC3 COC2 COL2 PAR501 PAR502 PAUSB0DATA06PAUSB0DATA07 COC22PAC1702 PAC902 PAC901 PAC802 PAC801 PAC302 PAC301 PAC202 PAC201 PAXHIGH01 PAL201 PAL202 PAC2202 PAC1701PAC2201 PAL204 PAL203 COC23 COSP0H COR6 PAC2301 PAC2302 PASP0H015 PASP0H013 PASP0H011 PASP0H09 PASP0H07 PASP0H05 PASP0H03 PASP0H01 PAR601 PAR602 PASP0H016 PASP0H014 PASP0H012 PASP0H010 PASP0H08 PASP0H06 PASP0H04 PASP0H02 COArm0prog COP5 COC30 COR25 COD2 PAArm0prog019 PAArm0prog017 PAArm0prog015 PAArm0prog013 PAArm0prog011 PAArm0prog09 PAArm0prog07 PAArm0prog05 PAArm0prog03 PAArm0prog01 COJ10VGA1 PAP503 PAP502 PAP501 PAC3001 PAC3002 PAR2501 PAR2502 PAD201 PAD202 COD1 COSP0CON PAArm0prog020 PAArm0prog018 PAArm0prog016 PAArm0prog014 PAArm0prog012 PAArm0prog010 PAArm0prog08 PAArm0prog06 PAArm0prog04 PAArm0prog02 PAJ10VGA101 PAJ10VGA102 PAJ10VGA103 PAJ10VGA104 PAJ10VGA105 PAJ10VGA106 PAJ10VGA107 PAJ10VGA108 PAJ10VGA109 PAJ10VGA1010 PAJ10VGA1011 COP6 COC61 COP7 COL6 COL5 PAD101 PAD102 PASP0CON05 PASP0CON04 PASP0CON03 PASP0CON02 PASP0CON01 COC35 COMCU0RESET PAL601 PAL602 PAL501 PAL502 PAP603 PAP602 PAP601 PAC6101 PAC6102 PAP703 PAP702 PAP701 COC27 COC6 PASP0CON011 PASP0CON09 PASP0CON08 PASP0CON07 PASP0CON06 PASP0CON010 PAC3501 PAC3502 PAMCU0RESET03 PAMCU0RESET04 COUSB0POWER PAC2702 COVR1 COC5 COC32 COSPD COP80VGA1 COC13 COVR2 COC29 PAC601 PAC602 PAC2701 PAC3202 PAC3201 PASPD016PASPD015PASPD014PASPD013PASPD012PASPD011PASPD010PASPD09 COC31 PAMCU0RESET01 PAMCU0RESET02 COP90VGA1 PAC1301 PAC1302 PAC2901 PAC2902 PAVR103 PAVR102 PAVR101 PAC502 PAC501 PAVR203 PAVR202 PAVR201 PAUSB0POWER06 PAUSB0POWER01PAUSB0POWER02PAUSB0POWER03PAUSB0POWER04PAUSB0POWER05 PAUSB0POWER08 COC28 COC7 PAC3101 PAC3102 PAP80VGA104 PAP80VGA103 COC11 PAUSB0POWER0Y PAUSB0POWER0X PAP90VGA102 COC12 PAC2802 COC4 PAC702 COC34 COC33 PAP90VGA101 PAP80VGA102 PAP80VGA101 PAUSB0POWER07 PAUSB0POWER09 PAC3401 PAC3402 PASPD01PASPD02PASPD03PASPD04PASPD05PASPD06PASPD07PASPD08 PAC3302 PAC3301 PAC1201 PAC1202 PAVR204 PAC1101 PAC1102 PAC2801 PAVR104 PAC402 PAC401 PAC701