Microprocessors Von Neumann architecture

The first used a single fixed pro- gram (like a numeric ).

To change the program, one has to re-wire, re-structure, or re-design the .

People that do that were not called computer programmers, as they are called today, but rather computer architects.

A Von Neumann computer uses a single mem- ory to hold both instructions and .

1 The program is written in an appropriate lan- guage, and is not hardwired in the computer itself; the computer is re-programmable.

In a Von Neumann computer programs can be seen as data; as a consequence a malfunction- ing program can the computer. In a Von Neumann an instruction is read from memory, decoded, the memory lo- cation the instruction asked for is fetched, the operation performed, and the results written back in memory.

The term ”von Neumann architecture” dates from June 1945, coined after the name of the mathematician , although such architecture was not designed by Von Neu- mann alone. The Von Neumann bottleneck

The separation between the CPU and memory leads to what is known as the von Neumann bottleneck.

The (data transfer rate) between the CPU and memory is very small in compar- ison with the amount of memory available and the rate at which the CPU can work.

As a result, the CPU is continuously forced to wait for data to be transferred to or from mem- ory. Since the CPU speed and memory size have increased much faster than the through- put between the two, the bottleneck has be- come more intense.

A memory between the CPU and main memory helps to alleviate the problem.

2 The

In the Harvard architecture there is a separate storage and signal pathways for instructions and data.

In this architecture, the word width, timing, implementation technology, and memory ad- dress structure can differ for program and data. Instruction memory is often wider than data memory.

In some systems, instructions can be stored in read-only memory while data memory gen- erally requires random-access memory. Typi- cally, there is much more instruction memory than data memory, so instruction addresses are much wider than data addresses.

The CPU can be either reading an instruction or reading/writing data from/to the memory.

3 Both cannot occur at the same time in a Von Neumann architecture, since the instructions and data use the same signal pathways and memory.

A computer following the Harvard architecture can be faster because it is able to fetch the next instruction at the same time it completes the current instruction (a phenomenon known as pipelining). Speed is gained at the expense of more complex electrical circuitry.

Modern high performance CPU chip designs incorporate aspects of both Harvard and von Neumann architecture. On chip cache mem- ory is divided into instruction cache and data cache. Complex instruction set computer–CISC

A complex instruction set computer (CISC) is a instruction set architecture in which each instruction can execute several low- level operations, such as a load from memory, an arithmetic operation, and a memory store.

The terms register-memory or memory-memory also apply to the same concept.

In the early days of computers, did not exist. Programming was done in either or . To make programming easier, computer architects cre- ated more and more complex instructions, which were direct representations of high level func- tions of high level programming languages. The attitude at the time was that hardware design was easier than design, so the com- plexity went into the hardware. 4 Another force that encouraged complexity was the lack of large memory. Indeed, as every byte of memory was precious (for example, an en- tire system only had a few kilobytes of storage) the industry moved to such features as highly encoded instructions, instructions which could be variable sized, instructions which did multi- ple operations and instructions which did both data movement and data calculation.

For the above reasons, CPU designers tried to make instructions that would do as much work as possible. This led to one instruction that would do all of the work in a single instruction: load up the two numbers to be added, add them, and then store the result back directly to memory.

The compact nature of CISC results in smaller program sizes and fewer calls to main memory. While many designs achieved the aim of higher throughput at lower cost and also allowed high- level language constructs to be expressed by fewer instructions, it was observed that pro- grams did not took profit from this. This is the point of departure from CISC to RISC.

Examples of CISC processors are the Intel CPUs (8051 included).

The terms RISC and CISC had become less meaningful with the continued evolution of both CISC and RISC designs and implementations. Reduced instruction set computer–RISC

The reduced instruction set computer, or RISC, is a CPU design philosophy that favors a re- duced and simpler instruction set.

The term ”load-store” applies to the same concept.

The idea was originally inspired by the discov- ery that many of the features that were in- cluded in traditional CPU designs (i.e., CISC) to facilitate coding were being ignored by the programs/programmers.

In the late 1970s researchers demonstrated that the majority of the many addressing modes present in CISC were ignored by most programs. This was a side effect of the increasing use of compilers to generate pro- grams, as opposed to writing them in assembly 5 language. In others words, compilers were not able to exploit the features of a CISC assembly.

At about the same time CPUs started to run even faster than the memory they talked to. It became apparent that more registers (and later caches) would be needed to support these higher operating frequencies. These additional registers and cache memories would require sizeable chip or board areas that could be made available if the complexity of the CPU was re- duced.

Since real-world programs spent most of their time executing very simple operations, some researchers decided to focus on making those common operations as simple and as fast as possible. The goal of RISC was to make in- structions so simple, each one could be exe- cuted in a single clock cycle. However RISC also had its drawbacks. Since a series of instructions is needed to complete even simple tasks, the total number of instruc- tions read from memory is larger, and there- fore takes longer (see the Von Neumann bot- tleneck).

In the early 1980s it was thought that ex- isting design was reaching theoretical limits. Future improvements in speed would be pri- marily through improved semiconductor ”pro- cess”, that is, smaller features (transistors and wires) on the chip. The complexity of the chip would remain largely the same, but the smaller size would allow it to run at higher clock rates (Moore’s law).

The RISC CDC 6600 supercomputer, designed in 1964 by Jim Thornton and Seymour Cray has 74 op-codes, while the 8086 from intel has 400. RISC designs have led to a number of success- ful platforms and architectures, some of the larger ones being: PlayStation, PlayStation 2, PlayStation Portable, PlayStation 3, Nintendo 64 game consoles, Nintendo’s Gamecube and Wii, Microsoft’s Xbox 360 and Palm PDA’s.

An instruction is made of micro-instructions. In a processor with pipeline, the processor works on one micro-instruction of several different in- structions at the same time.

For example, the RISC pipeline is broken into five stages:

1. Instruction fetch

2. Instruction decode and register fetch

3. Execute

4. Memory access

6 5. Register write back

The key to pipelining is the observation that the processor can start reading the next in- struction as soon as it finishes reading the last, meaning that it works on two instructions si- multaneously: one is being read, the next is being decoded – two stage pipelining.

While no single instruction is completed any faster, the next instruction would complete right after the previous one. The result was a much more efficient utilization of processor resources. Pipelining reduces cycle time of a processor and hence increases instruction throughput, the number of instructions that can be executed in a unit of time.

A typical CISC instruction to add two num- bers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of instructions similar to:

LOAD A, R1 LOAD B, R2 ADD R1, R2, R3 STORE R3, C LOAD next instruction

The R locations are registers, temporary mem- ory inside the CPU that is quick to access. The end result is the same, the numbers are added and the result placed in C, and the time taken to drive the addition to completion is no differ- ent (possibly greater than for the CISC case) from the non-pipelined case.

The key to understanding the advantage of pipelining is to consider what happens when this ADD function is ”half-way done”, at the ADD instruction for instance. At this point the circuitry responsible for loading data from memory is no longer being used, and would normally sit idle. In this case the pipeline con- troller fetches the next instruction from mem- ory, and starts loading the data it needs into registers. That way when the ADD instruction is complete, the data needed for the next ADD is already loaded and ready to go. The overall effective speed of the machine can be greatly increased because no parts of the CPU sit idle.

Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC each have a 2 stage pipeline).

Advantages of pipelining: the cycle time of the processor is reduced, thus increasing instruc- tion bandwidth in most cases. Advantages of not pipelining:

• The processor executes only a single in- struction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufac- ture.

• The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that ex- tra flip flops must be added to the data path of a pipelined processor.

• A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.

Many designs include pipelines as long as 7, 10 and even 31 stages (like in the Intel Pentium 4). The downside of a long pipeline is when a program branches, the entire pipeline must be flushed, a problem that branch predicting helps to alleviate.

The higher throughput of pipelines falls short when the executed code contains many bran- ches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its re- sult becomes available and the processor ap- pears to ”work” again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an unpipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages. Bibliography http://www.wikipedia.com

7